You type a question into ChatGPT or Claude and press Enter. Within a second or two, text starts streaming back onto your screen. It looks like magic — or at least like something happening invisibly inside a black box.
It's not magic. There's a concrete sequence of operations happening, each with a specific purpose. Understanding them doesn't require a math degree — it just requires a willingness to follow the chain. Here's the full journey, from keystroke to response.
Each token in the response is generated one at a time through this pipeline — thousands of times per response, in rapid sequence.
1 Your Text is Broken Into Tokens
The model doesn't see letters or words — it sees tokens. A tokenizer slices your input text into chunks that the model was trained to work with. Common words become single tokens; rarer words get split. "ChatGPT" might be two tokens: "Chat" and "GPT." Punctuation, spaces, and newlines are their own tokens.
The tokenization is deterministic — the same input always produces the same token sequence. It's based on a vocabulary built during the model's training, typically containing 50,000 to 100,000 possible tokens. Once tokenized, your message is a list of integers — each one an index into the vocabulary.
2 Tokens Become Vectors
Each token ID is looked up in an embedding table — a large matrix where each row is a learned vector representing a token. The model has trained these vectors so that tokens used in similar contexts have similar vectors. "King" and "Queen" will be near each other in embedding space; "king" and "bicycle" will be far apart.
Your entire input sequence is now a stack of vectors — one per token — each typically 4,096 to 12,288 numbers long, depending on the model size. This stack is the actual input to the main transformer network.
3 The Transformer Layers Run
The transformer is a deep neural network with dozens to hundreds of repeated layers. Your vector stack passes through each layer in sequence. Each layer does two main things:
Attention. Every token looks at every other token and figures out which ones are relevant to understanding it. "It" as a pronoun needs to figure out which noun it refers to. "Deposits" needs to know whether the surrounding context is about banking or geology. Attention is how the model handles context and relationships across the full input.
This attention mechanism is famously expensive — it scales with the square of the sequence length — which is why longer context windows cost significantly more to process. In a 128,000-token context, every token is attending to 127,999 others.
Feed-forward computation. After attention, each token's vector passes through a feed-forward network that applies another learned transformation. If attention figures out which context is relevant, the feed-forward layer applies what to do with that context. This is where a lot of the model's factual and reasoning capabilities are thought to live.
After all the layers, each token has a new, richly contextualized vector — one that now encodes not just what the token is in isolation, but what it means in the context of the entire input.
4 A Probability Distribution Over the Next Token
The vector for the last token in the sequence gets passed through a final linear layer that maps it to a score for every token in the vocabulary. These raw scores (logits) are then passed through a softmax function that converts them into probabilities — numbers between 0 and 1 that sum to 1.
What you get out is a probability distribution over the full vocabulary: "the next token is 'The' with probability 18%, 'I' with probability 12%, 'Here' with probability 9%..." and so on across all 50,000+ possible tokens. Most tokens will have near-zero probability; a handful will have meaningful probabilities.
This is the model's actual output. The AI doesn't directly decide what word to say — it outputs a probability distribution, and then something picks from that distribution.
5 Sampling: Picking the Next Token
The model can't output the whole distribution — it has to pick one token. This is where sampling happens, and it's controlled by settings you've probably seen mentioned: temperature and top-p.
Temperature controls how spread out the distribution is. A temperature of 0 always picks the single highest-probability token (deterministic, but repetitive and formulaic). A temperature of 1 samples from the distribution as-is. Higher temperatures make lower-probability tokens more likely — more creative, more surprising, sometimes more nonsensical.
Top-p (nucleus sampling) restricts sampling to only the smallest set of tokens whose combined probability exceeds a threshold p. With top-p = 0.9, the model samples only from whichever tokens collectively account for 90% of the probability mass, ignoring the long tail of highly unlikely tokens.
Together, these settings let you tune the trade-off between predictability and creativity. Coding tasks tend to benefit from lower temperature (more deterministic). Creative writing benefits from higher temperature (more varied).
6 The Sampled Token is Decoded and Returned
The selected token ID gets looked up in the vocabulary to get the actual text fragment. That fragment is streamed back to you. The token might be a full word, part of a word, a space, or a punctuation mark.
This is why responses stream in piece by piece: you're literally watching the model work one token at a time. The "thinking" cursor isn't a fake UI effect — it genuinely reflects that the model is computing one token, sending it, then computing the next.
7 Repeat Until Done
Now the cycle starts over. The newly generated token is appended to the input sequence, and the whole pipeline runs again: embed, attend, compute, sample, decode. One more token appears on your screen.
This continues until the model either generates a special "end of sequence" token (signaling it's done) or hits an output length limit set by the API. A typical response of 300 words is around 400 tokens — meaning this loop ran 400 times to produce it.
What This Means in Practice
A few things about AI behavior make much more sense once you understand this pipeline:
AI doesn't plan ahead. The model generates token by token. It can't "think about" the whole response and then write it — it's committing to each word before it knows what comes next. This is why models sometimes start down a path and then get stuck, or why they can contradict themselves mid-response: they're building the response left-to-right, one token at a time.
AI isn't retrieving stored answers. The model doesn't have a database of pre-written answers that it looks up. Every response is synthesized from scratch by the transformer, based on the patterns encoded in the weights during training. There is no storage cabinet; there's only the computation.
Longer inputs are genuinely more expensive. Because attention scales quadratically with sequence length, a 10,000-token context doesn't cost 10x more than a 1,000-token context — it costs roughly 100x more in attention computation. This is the real economic reason why API pricing is measured in tokens and why context length matters commercially.
Randomness is intentional. The sampling step means the model will give different responses to the same question on different runs, unless temperature is set to 0. This is a feature, not a bug — it's what allows the model to generate varied, non-formulaic text. The trade-off is that it also means occasional surprising or inconsistent outputs, even when everything else is held constant.
The pipeline from token to response runs in milliseconds per step across massive GPU clusters. The scale is staggering — and the result, a system that produces fluent, contextually appropriate text by iteratively sampling from a learned probability distribution, is one of the more remarkable things built in the last decade of computing.