ollama openclaw local models self-hosted free ai setup tutorial

Complete Guide: Setting Up Ollama with OpenClaw (Free Local AI)

Step-by-step guide to running Ollama with OpenClaw for free, private, local AI. Covers installation, model selection by hardware, configuration, performance tuning, and troubleshooting.

February 4, 2026

Complete Guide: Setting Up Ollama with OpenClaw (Free Local AI)

TL;DR: Install Ollama, pull a model that fits your hardware, point OpenClaw at localhost:11434, and you have a free, private AI assistant. Takes about 15 minutes. Zero API costs, ever.

Estimated time: 15-30 minutes
Difficulty: Beginner (if you're comfortable with a terminal)

Why Ollama + OpenClaw?

Ollama is the easiest way to run AI models on your own hardware. No cloud API keys, no monthly bills, no data leaving your machine. Combined with OpenClaw, you get a full AI assistant over Telegram, WhatsApp, or Discord — powered entirely by your own computer.

The tradeoff is quality. Local models are good and getting better fast, but they're not Claude or GPT-5.2. For many tasks — answering questions, drafting text, brainstorming, summarizing — they're more than enough. For complex reasoning or long code generation, you'll notice the gap.

Step 1: Install Ollama

Linux / macOS

curl -fsSL https://ollama.com/install.sh | sh

macOS (Homebrew)

brew install ollama

Windows

Download the installer from ollama.com/download.

Verify Installation

ollama --version

You should see something like ollama version 0.7.x. Ollama starts a background service automatically after install.

Step 2: Pick a Model for Your Hardware

This is where people overthink it. Here's the simple version:

Hardware → Model Matrix

Your RAM	Your GPU VRAM	Recommended Model	Performance
8 GB	None / integrated	`glm4:9b`	Usable, 5-15 tok/s
8 GB	6-8 GB NVIDIA	`llama3.1:8b`	Good, 20-40 tok/s
16 GB	None	`qwen3:14b`	Good, 8-15 tok/s
16 GB	8-12 GB NVIDIA	`deepseek-r1:14b`	Good, 15-30 tok/s
32 GB	16+ GB NVIDIA	`qwen3:32b`	Very good, 15-25 tok/s
32 GB+	Apple Silicon M1-M4	`qwen3:32b`	Very good, 20-40 tok/s
64 GB+	24+ GB NVIDIA	`deepseek-r1:70b`	Near-cloud quality, 5-15 tok/s

If you're not sure, start with glm4:9b. It's fast, capable for its size, and runs on basically anything.

Model Recommendations by Use Case

GLM-4 9B — The all-rounder. Fast, follows instructions well, handles multiple languages. Best bang for buck on modest hardware.

ollama pull glm4:9b

DeepSeek-R1 (14B / 32B / 70B) — Best for reasoning tasks. It "thinks" before answering (chain-of-thought), which produces notably better results for math, logic, and analysis. Slower because of the thinking step, but the quality jump is real.

ollama pull deepseek-r1:14b    # 16 GB RAM minimum
ollama pull deepseek-r1:32b    # 32 GB RAM minimum
ollama pull deepseek-r1:70b    # 64 GB RAM minimum

Llama 3.1 (8B / 70B) — Meta's workhorse. Well-rounded, excellent instruction following, huge community support. The 8B version is the classic choice for lightweight setups.

ollama pull llama3.1:8b        # 8 GB RAM minimum
ollama pull llama3.1:70b       # 64 GB RAM minimum

Qwen 3 (8B / 14B / 32B) — Alibaba's latest. Excellent multilingual support (especially CJK languages), strong coding ability, and good instruction following. The 32B version punches above its weight.

ollama pull qwen3:8b           # 8 GB RAM minimum
ollama pull qwen3:14b          # 16 GB RAM minimum
ollama pull qwen3:32b          # 32 GB RAM minimum

Pull Your Chosen Model

# Example: pulling GLM-4 9B
ollama pull glm4:9b

This downloads the model (several GB depending on size). First pull takes a few minutes on a fast connection.

Verify it's ready:

ollama list

You should see your model listed with its size and modification date.

Step 3: Test the Model Locally

Before configuring OpenClaw, make sure the model actually runs:

ollama run glm4:9b "What's the capital of France?"

You should get a response within a few seconds. If this works, the hard part is done.

Check the API is accessible:

curl http://localhost:11434/v1/models

This should return a JSON list of your downloaded models.

Step 4: Configure OpenClaw

Open your OpenClaw config file and add Ollama as a provider.

config.json

{
  "providers": {
    "ollama": {
      "type": "openai",
      "baseUrl": "http://localhost:11434/v1",
      "apiKey": "ollama"
    }
  },
  "defaultModel": "ollama/glm4:9b"
}

config.yaml

providers:
  ollama:
    type: openai
    baseUrl: "http://localhost:11434/v1"
    apiKey: "ollama"

defaultModel: "ollama/glm4:9b"

Important notes:

The type must be openai — Ollama exposes an OpenAI-compatible API
The apiKey must be present and non-empty, even though Ollama doesn't check it. Use any string. This is the #1 setup issue people hit
The baseUrl must include /v1 at the end
The defaultModel format is provider/model-name — the model name must exactly match what ollama list shows

Multiple Models

You can configure multiple Ollama models and switch between them:

{
  "providers": {
    "ollama": {
      "type": "openai",
      "baseUrl": "http://localhost:11434/v1",
      "apiKey": "ollama",
      "models": [
        { "id": "glm4:9b", "name": "GLM-4 Fast" },
        { "id": "deepseek-r1:14b", "name": "DeepSeek Reasoning" },
        { "id": "qwen3:14b", "name": "Qwen 3 14B" }
      ]
    }
  },
  "defaultModel": "ollama/glm4:9b"
}

Step 5: Restart OpenClaw and Test

clawdbot gateway restart

Check the logs:

clawdbot gateway logs

Look for confirmation that the provider loaded without errors. Then send a message to your bot through Telegram, WhatsApp, or whatever channel you have configured.

Performance Tuning

GPU Offloading

If you have an NVIDIA GPU, Ollama automatically uses it. Check GPU usage:

nvidia-smi

If your model is running on CPU despite having a GPU:

# Make sure NVIDIA drivers and CUDA are installed
nvidia-smi  # Should show your GPU

# Ollama auto-detects NVIDIA GPUs, but you can force GPU layers
# Create a custom Modelfile
cat > Modelfile << 'EOF'
FROM glm4:9b
PARAMETER num_gpu 999
EOF

ollama create glm4-gpu -f Modelfile

Apple Silicon (M1/M2/M3/M4): Ollama uses Metal acceleration automatically. No configuration needed. Apple Silicon is excellent for local inference — the unified memory means even large models run well.

Context Window

Default context is often 2048 tokens, which is too short for a conversation. Increase it:

cat > Modelfile << 'EOF'
FROM glm4:9b
PARAMETER num_ctx 8192
EOF

ollama create glm4-8k -f Modelfile

Then update your OpenClaw config to use glm4-8k instead.

Context window options:

Setting	RAM Impact	Good For
2048	Minimal	Quick Q&A, not conversations
4096	Low	Short conversations
8192	Moderate	Normal chat sessions
16384	High	Long conversations with memory
32768	Very high	Document analysis

Warning: Larger context windows use significantly more RAM. A 7B model with 32K context can easily use 12-16 GB.

Quantization Levels

When you pull a model with ollama pull, you get the default quantization (usually Q4_K_M). You can choose different levels:

Quantization	Quality	Size	Speed
Q2_K	⭐⭐	Smallest	Fastest
Q4_K_M	⭐⭐⭐⭐	Medium	Good balance
Q5_K_M	⭐⭐⭐⭐	Larger	Slightly slower
Q6_K	⭐⭐⭐⭐⭐	Large	Slower
Q8_0	⭐⭐⭐⭐⭐	Largest	Slowest
fp16	⭐⭐⭐⭐⭐	Huge	Slowest

Recommendation: Stick with Q4_K_M (the default). It's the sweet spot. Only go lower if your hardware can't handle it, and only go higher if you have RAM to spare and want marginal quality improvement.

Keep Models Loaded

By default, Ollama unloads models after 5 minutes of inactivity. For an always-ready assistant, keep the model loaded:

# Set keep-alive to 24 hours
curl http://localhost:11434/api/generate -d '{
  "model": "glm4:9b",
  "keep_alive": "24h"
}'

Or set it system-wide in your Ollama environment:

# Add to ~/.bashrc or /etc/environment
export OLLAMA_KEEP_ALIVE=24h

Local vs Cloud: Honest Comparison

Aspect	Ollama (7-9B)	Ollama (32-70B)	Claude Sonnet	GPT-5.2
Monthly cost	$0	$0	$15-40	$15-40
Response quality	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Speed	10-40 tok/s	5-20 tok/s	50-100 tok/s	50-100 tok/s
Complex reasoning	Fair	Good	Excellent	Excellent
Tool/function calling	Limited	Decent	Excellent	Excellent
Privacy	✅ Total	✅ Total	❌ Cloud	❌ Cloud
Needs internet	❌ No	❌ No	✅ Yes	✅ Yes
Hardware needed	8 GB RAM	32-64 GB RAM	None	None

Where local wins

You value privacy above everything
You want zero ongoing costs
You need offline access
You're experimenting and learning

Where cloud wins

Complex multi-step tasks and coding
Long context windows (200K+ tokens)
Tool use and function calling
Consistent speed regardless of your hardware

The hybrid approach

Run Ollama for everyday questions and switch to a cloud model when you need heavy lifting. Configure both providers in OpenClaw and switch as needed.

Common Issues and Fixes

"Connection refused" when OpenClaw tries to reach Ollama

# Check if Ollama is running
systemctl status ollama    # Linux
ollama ps                  # Any platform

# If not running, start it
ollama serve               # Foreground
systemctl start ollama     # Linux, background

Responses are extremely slow

No GPU detected: Check ollama ps — it shows whether GPU layers are being used
Model too large for your RAM: If the model doesn't fit in RAM, it pages to disk, which is 10-100x slower. Use a smaller model
High context window: Reduce num_ctx if you set it very high
Other programs using GPU: Close GPU-intensive apps (games, video editors)

"Out of memory" or system freezes

The model is too big for your hardware. Either:

Use a smaller model (8b instead of 14b)
Use a more aggressive quantization (Q2_K instead of Q4_K_M)
Reduce the context window
Close other memory-heavy applications

"apiKey is required" error in OpenClaw

Set apiKey to any non-empty string in your provider config. Ollama doesn't check it, but OpenClaw requires the field to exist.

Model outputs garbage or stops mid-sentence

Wrong chat template: Pull the model again — Ollama usually handles templates automatically
Q2 quantization: Too aggressive. Pull a Q4_K_M version instead
Context overflow: The conversation exceeded the model's context window. Start a new session

Ollama works locally but not from OpenClaw in Docker

If OpenClaw runs in Docker, localhost inside the container doesn't reach the host machine. Use:

{
  "providers": {
    "ollama": {
      "type": "openai",
      "baseUrl": "http://host.docker.internal:11434/v1",
      "apiKey": "ollama"
    }
  }
}

Or use the host's actual IP address.

The Easy Way

Setting up Ollama is straightforward, but you're trading cloud model quality for privacy and zero cost. If that tradeoff works for you, this is a great setup.

If you'd rather not manage server infrastructure, lobsterfarm provides managed OpenClaw hosting — deployment, updates, and support handled for you.

Get started with lobsterfarm →

Skip the setup. Start using your AI assistant today.

lobsterfarm gives you a fully managed OpenClaw instance — one click, your own server, running 24/7.

← All Guides lobsterfarm.ai