The context window is the maximum amount of text a language model can read in a single request. Learn how it works, current limits per model, and how to handle long conversations.
More about Context Window
The context window is the maximum amount of text, measured in tokens, that a large language model can process in a single request. Everything the model "sees" on a given turn has to fit into this window: the system prompt, the user's question, the conversation history, any retrieved documents from retrieval augmented generation, and the model's own output.
Because the context window is finite, designing how to use it well is one of the central skills of building an AI chatbot.
Tokens, Not Words
Models do not read words, they read tokens. A token is a chunk of text that the model's tokenizer produces, usually between one character and one full word. In English, one token averages about four characters, or roughly 0.75 words. A 500-word document is about 650 tokens.
This matters because context window limits are always expressed in tokens, not words. A 128K context window means about 96,000 words of usable space, minus whatever the model needs for its output.
Current Context Window Sizes
Common models fall into these ranges:
- OpenAI GPT-4o and GPT-4.1 class: 128K tokens
- Anthropic Claude 4 family (Opus, Sonnet, Haiku): 200K tokens standard, with 1M-token variants available
- Google Gemini 2.x Pro: 1M to 2M tokens
- Smaller open-source models (Llama, Mistral, Gemma): 8K to 128K is typical
Larger is not automatically better. Bigger windows cost more per call, latency increases, and models often pay less attention to content in the middle of a long context, a phenomenon sometimes called "lost in the middle".
Why the Context Window Matters for Chatbots
Every serious chatbot design decision touches the context window:
- Conversation history: how much chat history to include on each turn.
- RAG chunks: how many retrieved passages to stuff in, and how long each should be.
- System prompt: essential instructions that should never get truncated.
- Output budget: reserving tokens for the reply so the model does not get cut off.
- Cost: every token in the window is paid for on every turn.
The practical math: if you have a 128K window and spend 40K on retrieval, 10K on history, 2K on the system prompt, and leave 4K for output, you have about 72K of headroom. That sounds like a lot until a customer sends a long document and asks about it.
Strategies for Managing the Context Window
Teams use a handful of patterns to stay inside the window:
- Trim and summarise history: keep the last N turns verbatim, summarise older ones into a short paragraph.
- Retrieve, do not stuff: rather than passing the whole knowledge base, use a vector database and semantic search to retrieve only the top few most relevant chunks.
- Rank and filter: use a cross-encoder reranker to pick the highest-quality retrieved passages.
- Token-aware chunking: chunk documents at boundaries that fit cleanly under the limit, with enough overlap to preserve meaning.
- Output capping: set max output tokens so replies stay short and predictable.
SiteSpeak handles this automatically for teams. The chatbot retrieves only the most relevant passages from your indexed content, summarises long transcripts, and leaves enough headroom for a clean answer. That keeps cost low and quality high without manual tuning.
Context Window vs. Model Memory
A common confusion: the context window is temporary. Everything that fits in the window is available for one call; once the request returns, the model forgets. To give a chatbot long-term memory, you need a separate storage layer, often a vector store or a structured agent memory system, and you load relevant pieces into the context window on each turn.
The window is working memory. Persistent memory is a different problem that every production chatbot has to solve.