Zum Inhalt springen
>_<
AI EngineeringWiki

Ollama: Local LLMs Made Easy

Tools Β· 8 min

Ollama makes local Large Language Models accessible. No cloud, no API costs, no data flowing anywhere. In 5 minutes you have your own AI chat running on your hardware.

What is Ollama?

Ollama is a CLI tool to run LLMs locally. Supports 132+ models (Llama, Mistral, CodeLlama, etc.) and runs on macOS (with Metal GPU acceleration), Linux and Windows (via WSL2).

Supported Models (Selection)

Llama 3.2

1B, 3B, 8B, 70B β€” Text and Vision

Mistral

7B β€” Fast, efficient

Phi

3.5B β€” Small but smart

CodeLlama

7B, 13B, 34B β€” Coding specialized

Qwen

0.5B to 72B β€” multilingual

Gemma

2B, 7B β€” Google

Installation

macOS

brew install ollama

Linux/WSL2

curl -fsSL https://ollama.com/install.sh | sh

Docker (our recommendation)

docker run -d \
  --name ollama \
  -v ollama_data:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama:latest

Download Models

The first model is downloaded automatically on first start. You can also preload models explicitly:

# Download model
ollama pull llama3.2

# List available models
ollama list

# Show model info
ollama show llama3.2

Recommended Starter Models

ModelSizeVRAMUse Case
phi3:3.8b2.3 GB~4 GBFast, beginner
llama3.2:1b1.3 GB~2 GBLightweight, fast
llama3.2:3b3.8 GB~6 GBBalance
llama3.1:8b4.7 GB~8 GBAdvanced
mistral:7b4.1 GB~8 GBCoding, reasoning

Using Ollama

Interactive Chat

ollama run llama3.2

REST API

Ollama provides a REST API on port 11434:

# Chat
curl -X POST http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2",
    "messages": [
      { "role": "user", "content": "Hello!" }
    ]
  }'

# Generate (single response)
curl -X POST http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "What is Docker?"
  }'

GPU Configuration

Ollama automatically uses available GPUs. For Docker, the GPU needs to be passed through:

# NVIDIA GPU
docker run -d --gpus all \
  --name ollama \
  -v ollama_data:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama:latest

# Or with docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

Our Docker Swarm Setup

In our 3-node Swarm, Ollama runs on the GPU node (docker-swarm3):

services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
      placement:
        constraints:
          - node.hostname == docker-swarm3
    networks:
      - ai-network

Web Interface: Open WebUI

For a ChatGPT-like interface, we use Open WebUI:

services:
  open-webui:
    image: open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open-webui_data:/app/backend/data
    depends_on:
      - ollama
    networks:
      - ai-network

Next Steps

  • β€’ Set up RAG: RAG Complete Guide β†’
  • β€’ Compare models: Test multiple models in parallel
  • β€’ Monitoring: Enable Prometheus metrics

Next step: ship workflows that stay operable

Use proven n8n patterns, templates and integrations for workflows that stay local, documented, and auditable.

Why AI Engineering
  • Local and self-hosted by default
  • Documented and auditable
  • Built from our own runtime
  • Made in Austria
Not legal advice.