AI Chatbot Terms > 4 min read

Context Window: How LLM Token Limits Work and Why They Matter

The context window is the maximum amount of text a language model can read in a single request. Learn how it works, current limits per model, and how to handle long conversations.

More about Context Window

The context window is the maximum amount of text, measured in tokens, that a large language model can process in a single request. Everything the model "sees" on a given turn has to fit into this window: the system prompt, the user's question, the conversation history, any retrieved documents from retrieval augmented generation, and the model's own output.

Because the context window is finite, designing how to use it well is one of the central skills of building an AI chatbot.

Tokens, Not Words

Models do not read words, they read tokens. A token is a chunk of text that the model's tokenizer produces, usually between one character and one full word. In English, one token averages about four characters, or roughly 0.75 words. A 500-word document is about 650 tokens.

This matters because context window limits are always expressed in tokens, not words. A 128K context window means about 96,000 words of usable space, minus whatever the model needs for its output.

Current Context Window Sizes

Common models fall into these ranges:

  • OpenAI GPT-4o and GPT-4.1 class: 128K tokens
  • Anthropic Claude 4 family (Opus, Sonnet, Haiku): 200K tokens standard, with 1M-token variants available
  • Google Gemini 2.x Pro: 1M to 2M tokens
  • Smaller open-source models (Llama, Mistral, Gemma): 8K to 128K is typical

Larger is not automatically better. Bigger windows cost more per call, latency increases, and models often pay less attention to content in the middle of a long context, a phenomenon sometimes called "lost in the middle".

Why the Context Window Matters for Chatbots

Every serious chatbot design decision touches the context window:

  • Conversation history: how much chat history to include on each turn.
  • RAG chunks: how many retrieved passages to stuff in, and how long each should be.
  • System prompt: essential instructions that should never get truncated.
  • Output budget: reserving tokens for the reply so the model does not get cut off.
  • Cost: every token in the window is paid for on every turn.

The practical math: if you have a 128K window and spend 40K on retrieval, 10K on history, 2K on the system prompt, and leave 4K for output, you have about 72K of headroom. That sounds like a lot until a customer sends a long document and asks about it.

Strategies for Managing the Context Window

Teams use a handful of patterns to stay inside the window:

  • Trim and summarise history: keep the last N turns verbatim, summarise older ones into a short paragraph.
  • Retrieve, do not stuff: rather than passing the whole knowledge base, use a vector database and semantic search to retrieve only the top few most relevant chunks.
  • Rank and filter: use a cross-encoder reranker to pick the highest-quality retrieved passages.
  • Token-aware chunking: chunk documents at boundaries that fit cleanly under the limit, with enough overlap to preserve meaning.
  • Output capping: set max output tokens so replies stay short and predictable.

SiteSpeak handles this automatically for teams. The chatbot retrieves only the most relevant passages from your indexed content, summarises long transcripts, and leaves enough headroom for a clean answer. That keeps cost low and quality high without manual tuning.

Context Window vs. Model Memory

A common confusion: the context window is temporary. Everything that fits in the window is available for one call; once the request returns, the model forgets. To give a chatbot long-term memory, you need a separate storage layer, often a vector store or a structured agent memory system, and you load relevant pieces into the context window on each turn.

The window is working memory. Persistent memory is a different problem that every production chatbot has to solve.

Frequently Asked Questions

Bigger is not automatically better. For most customer-facing chatbots, 32K to 128K tokens is plenty once you use retrieval augmented generation to pull in only relevant content. Million-token windows are useful for document analysis and long codebases, but they cost more per call, add latency, and suffer from attention dropoff in the middle of long contexts. Start small and only scale up when you hit a real limit.

Most providers either reject the request with an error or silently truncate the oldest content until it fits. Neither is ideal. The right behaviour is to detect the overflow before you send the request, summarise or drop older chat history, and reduce the number of retrieved passages. Monitoring for near-limit requests in production is a good early warning that your prompt design needs tightening.

They are often used interchangeably. Strictly, the context window is the total space available for input plus output, and the token limit can refer either to that total or to the maximum number of output tokens in a single response. When in doubt, check the provider's docs: the two numbers can be different, especially for models that cap output at 4K or 8K even when the total window is much larger.

Share this article:
Copied!

Ready to automate your customer service with AI?

Join over 1000+ businesses, websites and startups automating their customer service and other tasks with a custom trained AI agent.

Create Your AI Agent No credit card required