Every AI language model has a context window — a hard limit on how much text it can see at one time. Everything within that window, the model can reason about. Everything outside it, the model simply cannot access, as if it doesn't exist.

The context window is one of the most practically important properties of a language model, and it's also one of the most misunderstood. People often describe AI as "forgetting" things or "losing track" of earlier conversation. Usually, what's actually happening is much more structural than that: the content has scrolled out of the context window, and the model has no access to it whatsoever — it's not that it forgot, it's that the information is no longer present.

What's a Token?

Context windows are measured in tokens, not words or characters. A token is roughly a word or word-fragment — the basic unit the model processes. "Hello" is one token. "Unbelievable" might be three tokens: "Un", "believ", "able". Whitespace and punctuation each take up tokens too.

A useful rule of thumb: one token is about three-quarters of a word, or about four characters. So 1,000 tokens is roughly 750 words — about two or three pages of a novel. 100,000 tokens is roughly 75,000 words, or a short book.

The token count in the context window includes everything the model sees: the system prompt (instructions given to the AI by the developer), the entire conversation history, any documents or files you've pasted in, and the AI's own previous responses.

The context window holds everything the model sees at once — system instructions, conversation history, documents, and your current question.

Why Context Windows Have Limits

The limit isn't arbitrary — it's a consequence of the transformer architecture. The key operation in a transformer is attention: every token attends to every other token to compute its contextual meaning. The computational cost of attention scales with the square of the sequence length. Double the context, and you quadruple the compute. A 200,000-token context isn't just 200x more expensive than a 1,000-token context — it's roughly 40,000x more expensive in attention compute.

This is why longer context windows are genuinely hard to build. It's not a software problem you fix with a config setting; it requires either architectural innovations to make attention more efficient or massive increases in compute. Recent architectures like sliding window attention and various sparse attention mechanisms are attempts to reduce this quadratic scaling, and they've enabled the large context windows we have today — but there are still real costs.

Context Sizes Today

Context windows have grown dramatically over the past few years. GPT-3 in 2020 had a 4,096-token context — roughly 3,000 words. Today's leading models have contexts orders of magnitude larger:

Model Context Window Approx. Word Equivalent
GPT-4o 128,000 tokens ~96,000 words (a short novel)
Claude 3.5 / 3.7 200,000 tokens ~150,000 words (a long novel)
Gemini 1.5 Pro 1,000,000 tokens ~750,000 words (several books)
Gemini 2.0 Flash 1,000,000 tokens ~750,000 words

One million tokens sounds extraordinary — and it is. It means you can, in theory, load an entire codebase, a year's worth of meeting transcripts, or a large collection of research papers into a single context and ask questions across all of it.

What Happens at the Limit

When a conversation or document exceeds the context window, something has to give. Different systems handle this differently:

Hard truncation. The oldest content gets dropped. If the system drops from the beginning, the model loses its earliest conversation turns. If it drops from the middle, the model loses whatever was there. Either way, that information is gone — the model will have no knowledge of it.

Summarization. Some systems periodically summarize older context into a compact form, then discard the original. This preserves the gist of early conversation at the cost of detail. Claude and ChatGPT both use variants of this in their products.

User-facing limits. Some interfaces simply stop the conversation when you hit the limit and require you to start a new one.

The practical implication: if you're having a very long conversation with an AI and it suddenly seems to have "forgotten" something you said early on, you've likely hit the context limit. Starting fresh is often the right move.

The "Lost in the Middle" Problem

Bigger isn't always uniformly better. Research has shown that LLMs tend to pay more attention to content at the very beginning and very end of the context window than to content buried in the middle. A document stuffed in the middle of a 200,000-token context may receive less careful attention than the same document in a short, focused prompt.

This "lost in the middle" effect varies by model and is improving with each generation, but it's a real concern when working with large contexts. Practical workarounds include placing the most important content near the beginning or end of the prompt, and using RAG (retrieval-augmented generation) to retrieve only the most relevant chunks rather than stuffing everything in.

Why Context Window Size Matters for Real Use Cases

The practical impact of context size depends on what you're doing:

Casual chat: Context size rarely matters. Most conversations stay well within even small windows.

Document analysis: Context size becomes critical. A 4K context can't hold a 10-page report. A 128K context can. A 1M context can hold an entire book. If you want to ask questions across a whole document without chunking it up, you need a context large enough to hold the whole thing.

Long coding sessions: Codebases grow large quickly. A model with a small context will lose track of earlier files as the conversation grows. A model with a large context can hold multiple files simultaneously, which makes it dramatically more useful for non-trivial software work.

Multi-document research: The ability to load five research papers simultaneously and ask the model to synthesize them is genuinely new — it wasn't possible even two years ago. Gemini's 1M-token context makes this practical for the first time.

The Honest Limitation

Despite rapid growth in context sizes, the context window remains a fundamental constraint. A model cannot synthesize knowledge it doesn't have in front of it. And while retrieval systems like RAG can help route the right content into the context, they introduce their own failure modes — retrieved content that's slightly off, important context that doesn't get retrieved, and coordination complexity.

The context window is finite working memory. Like human working memory, it's surprisingly powerful within its limits, and surprisingly useless when those limits are exceeded. Understanding the limit — what's in it, what's been pushed out, and how to manage it — is one of the most important practical skills for getting reliable results from language models.