Context Windows Explained: Why Your AI 'Forgets' Mid-Conversation
What context windows are, why they matter, why your AI gets worse in long conversations, and how to work within the limits. Plain English, no PhD required.
Context Windows Explained: Why Your AI 'Forgets' Mid-Conversation
TL;DR: AI models have a fixed-size "working memory" called a context window. Everything — your messages, the AI's responses, system prompts, and files — must fit inside it. When it fills up, things start getting dropped or degraded. Here's how it works and how to deal with it.
What Are Tokens?
Before we can talk about context windows, you need to understand tokens. Don't worry — this is simple.
A token is a chunk of text. Not quite a word, not quite a character. Roughly:
- 1 token ≈ 4 characters of English text
- 1 token ≈ ¾ of a word
- "Hello, how are you?" = about 6 tokens
- A typical paragraph = 50-100 tokens
- A full page of text = 500-700 tokens
The exact tokenization depends on the model, but these estimates are close enough for practical purposes.
Why tokens instead of words? Because language models don't process "words" — they process sub-word units that handle things like compound words, punctuation, and multiple languages more efficiently. "Unbelievable" might be 3 tokens ("un" + "believ" + "able"), while "cat" is just 1.
You don't need to think about this day-to-day. Just know that when someone says "200k tokens," they mean roughly 150,000 words or about 500 pages of text.
The Context Window: AI's Short-Term Memory
A context window is the maximum amount of text an AI model can consider at once. Think of it as the model's desk — everything it needs to work with has to fit on the desk. If the desk gets full, something falls off.
Current Context Window Sizes
| Model | Context Window | Roughly Equals |
|---|---|---|
| Claude Sonnet/Opus | 200,000 tokens | ~500 pages |
| GPT-4o | 128,000 tokens | ~320 pages |
| GPT-4.1 | 1,000,000 tokens | ~2,500 pages |
| Gemini 1.5 Pro | 2,000,000 tokens | ~5,000 pages |
These numbers sound enormous. 500 pages! Who's going to type 500 pages into a chat?
Here's the thing: you're not the only thing taking up space.
What's Actually Inside the Context Window
When you send a message, the API doesn't just receive your message. It receives:
- System prompt — OpenClaw's instructions, personality (SOUL.md), memory files, skill definitions (~2,000-10,000 tokens)
- Conversation history — Every message you've sent AND every response, in the entire conversation (~grows with each exchange)
- Tool results — If the AI browsed the web, read a file, or ran a command, those results are in the context (~varies wildly)
- Your new message — The thing you just typed (~50-200 tokens)
- Output space — Room for the AI's response (~200-2,000 tokens)
So a "simple" conversation that's been going for a while might look like:
System prompt: 5,000 tokens
Memory/personality: 3,000 tokens
20 previous messages: 4,000 tokens
20 previous responses: 12,000 tokens
3 web page reads: 15,000 tokens
Your new message: 100 tokens
─────────────────────────────────
Total: 39,100 tokens
That's almost 40K tokens for a relatively modest conversation. In a heavy session with lots of tool use, you can hit 100K+ easily.
Why Responses Degrade in Long Conversations
You've probably noticed this: after chatting with an AI for a while, the responses start getting... worse. It forgets things you said earlier. It contradicts itself. It starts ignoring instructions from the system prompt.
This isn't the model getting tired. There are two things happening:
1. The "Lost in the Middle" Problem
Research has shown that language models pay the most attention to the beginning and end of the context window. Stuff in the middle gets less attention. It's like reading a long book — you remember the first chapter and the last chapter better than chapter 17.
In a long conversation, your early messages end up in the middle of the context, sandwiched between the system prompt (beginning) and recent messages (end). The AI literally pays less attention to them.
2. Attention Dilution
With 200K tokens of context, the model's attention is spread across a lot of text. Even though the window is big enough to hold everything, the model's ability to focus on any specific part decreases as the total size grows.
Think of it like being in a room with 200 people talking. You can hear everyone, technically, but you're going to miss details compared to a room with 5 people.
3. Context Window Overflow
When the context exceeds the model's limit, something has to give. Depending on the implementation, the oldest messages get truncated, or the request simply fails. Either way, information is lost.
How Compaction Works
This is where things get clever. Compaction is OpenClaw's solution to context window limits.
Instead of letting the conversation grow until it hits the wall, compaction kicks in at a configurable threshold. Here's what happens:
- The conversation reaches the threshold (say, 80,000 tokens)
- OpenClaw asks the AI to summarize the older messages
- The full conversation history gets replaced with a concise summary + the recent messages
- The context drops from 80K back down to maybe 20-30K tokens
The result: you keep the important context from earlier in the conversation, but in a compressed form that costs way less in API tokens.
Before compaction:
[System Prompt] [Message 1] [Response 1] ... [Message 40] [Response 40]
Total: 85,000 tokens
After compaction:
[System Prompt] [Summary of messages 1-30] [Message 31] [Response 31] ... [Message 40] [Response 40]
Total: 25,000 tokens
Compaction isn't perfect — some nuance gets lost in summarization — but it's far better than hitting the context wall or paying for enormous contexts.
Tips for Staying Within Context Limits
1. Start New Conversations
The simplest trick. After 20-30 exchanges, or when you're switching topics, start a fresh conversation. OpenClaw's persistent memory means your AI still knows who you are and what you're working on — it just doesn't carry 50 messages of old context.
2. Enable Compaction
If you're self-hosting, make sure compaction is enabled:
{
"session": {
"compaction": {
"enabled": true,
"threshold": 80000
}
}
}
Compaction is available in OpenClaw and can be configured to your needs.
3. Be Specific with File Sharing
Instead of pasting an entire 2,000-line file and saying "find the bug," paste the relevant 50 lines. Every character you send takes up context space. Be surgical.
4. Summarize Before Continuing
If you've had a long conversation and want to keep going on the same topic, ask the AI: "Summarize our conversation so far." Then start a new conversation and paste the summary. You get the context without the overhead.
5. Use the Right Model
If you regularly hit context limits, consider models with larger windows. Gemini's 2M token context is massive, though quality can vary. Claude's 200K is generous for most use cases.
6. Watch for Tool-Heavy Sessions
Web browsing and file reading dump a lot of text into the context. A single web page can be 5,000-20,000 tokens. If you're asking the AI to read multiple pages or large files, the context fills up fast.
Context Windows vs Memory
People often confuse context windows with memory. They're different:
Context window = What the AI can see right now, in this conversation. It's short-term, limited, and resets between sessions.
Memory = What persists between conversations. In OpenClaw, this is MEMORY.md, SOUL.md, and daily notes. These files get loaded into the context window at the start of each conversation, but they're separate from the conversation itself.
Think of it this way: your context window is your desk. Your memory is your filing cabinet. You pull files from the cabinet onto the desk when you need them, but the desk still has a finite size.
What's Coming
Context windows are getting bigger every year. GPT-4.1 already handles 1M tokens. Gemini does 2M. Future models will likely push into 10M+ territory.
But bigger isn't always better. Larger context windows cost more (you pay per token), and attention dilution means the model's quality can still degrade with very large contexts. Compaction and smart context management will matter even when windows are enormous.
The real future isn't infinitely large context windows — it's smarter systems that know what to load and what to leave out.
The Bottom Line
Context windows are why your AI forgets things. It's not dumb — it's working within a fixed-size constraint that gets crowded fast.
Understanding this constraint helps you work with AI more effectively: keep conversations focused, enable compaction, start fresh sessions when switching topics, and be conscious of what you're putting into the context.
Don't want to manage server infrastructure? lobsterfarm provides managed OpenClaw hosting — deployment, updates, and support handled for you.
Skip the setup. Start using your AI assistant today.
lobsterfarm gives you a fully managed OpenClaw instance — one click, your own server, running 24/7.