Fix: OpenClaw Gives Bad Responses with Ollama (Local AI Acting Weird)
Your OpenClaw bot acts dumb with Ollama? It's not the bot — it's the model config. Here's how to fix context windows, model selection, and settings.
Fix: OpenClaw Gives Bad Responses with Ollama (Local AI Acting Weird)
TL;DR: Your local model's context window is probably too small, you're using a model that's too lightweight, or your config is wrong. Here's how to fix all three.
The Error
There's no single error message here. Instead, you get symptoms:
- Bot gives nonsensical, unrelated, or repetitive answers
- Bot ignores your system prompt / personality
- Bot "forgets" what you said 2 messages ago
- Bot responds with fragments or half-sentences
- Bot works great for simple questions but falls apart in real conversations
As one GitHub user put it:
Why does my bot act like an idiot with Ollama?
Yeah. That was literally the issue title. (GitHub #4333)
Why This Happens
Local models through Ollama are fundamentally different from cloud APIs like Claude or GPT-5. Three things usually go wrong:
1. Context window is too small
OpenClaw sends the entire conversation history (system prompt + all messages) with each request. Cloud models handle 128K-200K tokens easily. Most local models default to 2048-4096 tokens — which fills up after just a few exchanges.
When the context overflows, the model either:
- Ignores older messages (including your system prompt)
- Generates garbage because it can't "see" the full conversation
- Repeats itself because it's lost track of what was already said
2. Model is too small
Running a 3B or 7B parameter model and expecting Claude-level intelligence is like putting a bicycle engine in a truck. Small models are great for simple Q&A but fall apart with:
- Complex instructions
- Multi-turn conversations
- Tool use
- Maintaining a consistent persona
3. Config is wrong or missing
Ollama has specific settings that OpenClaw needs. Missing or wrong config means the model runs with bad defaults.
How to Fix It
Step 1: Choose the right model
Not all models are created equal. Here's a realistic guide:
| Your Hardware | Recommended Model | VRAM Needed |
|---|---|---|
| 8GB VRAM (RTX 3070) | llama3.1:8b-instruct-q4_K_M |
~5GB |
| 12GB VRAM (RTX 3080) | mistral-nemo:12b-instruct |
~8GB |
| 16GB VRAM (RTX 4080) | llama3.1:70b-instruct-q4_K_M |
~40GB* |
| 24GB+ VRAM (RTX 4090) | qwen2.5:72b-instruct-q4_K_M |
~40GB* |
| CPU only (no GPU) | llama3.2:3b |
N/A (slow) |
*70B+ models need multiple GPUs or will offload to CPU (very slow).
Pull your model:
ollama pull llama3.1:8b-instruct-q4_K_M
Rule of thumb: Use the largest instruct/chat model your hardware can run at reasonable speed. If it takes more than 5 seconds per response, go smaller.
Step 2: Increase the context window
This is the #1 fix for "dumb" behavior. Tell Ollama to use a larger context:
# Create a custom model with more context
cat > ~/Modelfile << 'EOF'
FROM llama3.1:8b-instruct-q4_K_M
PARAMETER num_ctx 8192
EOF
ollama create llama3.1-8k -f ~/Modelfile
For even longer conversations:
cat > ~/Modelfile << 'EOF'
FROM llama3.1:8b-instruct-q4_K_M
PARAMETER num_ctx 16384
EOF
ollama create llama3.1-16k -f ~/Modelfile
Warning: Larger context windows use more VRAM. If you set num_ctx too high, Ollama will either crash or fall back to CPU (which is painfully slow). Start with 8192 and increase if your GPU can handle it.
Step 3: Configure OpenClaw correctly
Here's a config that actually works:
{
"providers": {
"ollama": {
"type": "ollama",
"baseUrl": "http://localhost:11434",
"model": "llama3.1-8k",
"options": {
"temperature": 0.7,
"num_ctx": 8192,
"num_predict": 1024,
"top_p": 0.9,
"repeat_penalty": 1.1
}
}
},
"ai": {
"provider": "ollama",
"maxTokens": 1024,
"contextStrategy": "trim"
}
}
Key settings:
num_ctx: Must match what you set in the Modelfile (or Ollama ignores it)num_predict: Max tokens per response. Don't set this too high or the model will ramblerepeat_penalty: Set to 1.1 to reduce repetitive responses (a common issue with small models)contextStrategy: "trim": Tells OpenClaw to trim old messages when approaching the context limit, instead of sending everything and hoping for the best
Step 4: Simplify your system prompt
Cloud models can handle a 2000-word system prompt with detailed personality, rules, and examples. Local models can't — that system prompt eats your context window.
Keep your system prompt under 500 tokens for 8B models:
You are a helpful AI assistant. Be concise and clear in your responses.
vs. the 2000-token personality document you'd use with Claude. Save the detailed stuff for bigger models.
Step 5: Test it
openclaw gateway restart
# Quick test
openclaw chat --provider ollama "What is 2+2?"
If the response is coherent and fast, you're good. Try a multi-turn conversation to make sure context handling works.
How to Prevent It
- Match expectations to hardware. A 7B model on a consumer GPU is impressive for what it is, but it won't match Claude Opus or GPT-5. Adjust your system prompt complexity and conversation length accordingly.
- Monitor VRAM usage. Run
nvidia-smiwhile chatting. If VRAM is maxed out, reducenum_ctxor use a smaller quantization. - Clear sessions regularly. Long conversations degrade quality with small models. Reset sessions when they get long.
- Keep Ollama updated.
ollama pull <model>also updates the model. Run it periodically.
The Easy Way
lobsterfarm is a managed hosting service for OpenClaw — deployment, updates, and support handled for you.
Skip the setup. Start using your AI assistant today.
lobsterfarm gives you a fully managed OpenClaw instance — one click, your own server, running 24/7.