Introduction
Today Large Language Models (LLMs) such as ChatGPT, LLaMA, Gemini, Claude and Mistral are becoming important part of modern AI applications. Millions of users interacts with these models every day for asking questions, generating contents, writing code, summarizing documents and solving different tasks. From user perspective, these models looks very powerful because they can answer questions, generate code and create contents within few seconds. Due to this reason, usage of LLMs are increasing very rapidly across different industries. From user perspective, process looks very simple. User enters a prompt, waits for few seconds and model generate a response. However, behind this simple interaction there is a very complex pipeline running inside model.
Many people assume that ChatGPT or other LLMs works similar to a search engine. They thinks that when a question is asked, model search some database and then return the answer. Actually this is not how modern LLMs works. In most of the cases model do not search any database during inference. Instead, it uses knowledge already stored inside billions of parameters and generate response one token at a time.
For example, suppose a user enters following prompt:
The capital of France is
Now model does not directly know that next word should be Paris in same way humans knows it. Internally model performs lot of mathematical calculations. It converts text into tokens, transforms tokens into embeddings, process them through Transformer layers, calculate probabilities for every possible next token and finally select one token according to those probabilities.
If model predicts:
Paris = 85%
London = 7%
Rome = 4%
Delhi = 2%
then token Paris has highest probability and therefore it will most likely gets selected as next token.
After generating Paris, model does not stop. Generated token becomes part of input and whole process starts again. Model predicts next token, then another token and continue this cycle until complete response is generated.
This process of generating one token at a time is known as autoregressive generation and it is foundation of modern decoder-only architectures such as GPT and LLaMA.
Another important thing to understand is that inference and training are completely different process. During training, model learns from huge amount of data. It calculates loss, performs backpropagation, updates weights and gradually improve its predictions. But during inference no learning happen. Model only uses knowledge which was already learned during training.
In other words:
During Training:
- Forward Pass
- Loss Calculation
- Backpropagation
- Gradient Computation
- Weight Updates
During Inference:
- Forward Pass
- Token Prediction
- Response Generation
There are no gradients, no optimizer and no weight update during inference.
The complete inference pipeline can be represented as:
User Prompt
↓
Tokenization
↓
Token IDs
↓
Embeddings
↓
Positional Information
↓
Transformer Layers
↓
Logits
↓
Softmax
↓
Sampling
↓
Next Token Prediction
↓
Generated Response
Understanding this pipeline is very important because inference is the stage where users actually interact with model. No matter how large or powerful model is, every response generated by ChatGPT, Gemini, Claude, LLaMA or any other LLM ultimately follows same fundamental inference process.
By understanding inference pipeline, we can better understand how modern LLMs generates text, why they are able to answer questions and how they predict next token again and again to create complete response. Most of the intelligence which we see in ChatGPT or other LLMs comes from this inference process and knowledge stored inside model weights.
In this blog we will understand each stage of inference pipeline in detail and see how a simple text prompt is transformed into meaningful AI-generated response.
1. User Prompt
The inference process starts when user enters a prompt into model. A prompt can be a question, instruction or statement which acts as input for LLM.
Example:
User: What is the capital of France?
At this stage model only receives raw text. It do not understand meaning of sentence like humans do. For model, input is only sequence of characters stored inside memory. Purpose of this step is simply to providing information to model so it can start processing. Once prompt is received, it gets passed to tokenizer because model can not directly process human language.
2. Tokenization
After receiving the prompt, first actual processing step is tokenization. Tokenization is process of breaking the input text into smaller pieces called tokens. These tokens can be words, subwords or even individual characters depending on tokenizer being used.
Example:
Input: The capital of France is
Tokens: ["The", "capital", "of", "France", "is"]
Reason tokenization is needed is because neural networks can not works directly with text. They only understands numbers. Therefore text must first be converted into structured format. After tokenization, each token gets converted into unique numerical identifier known as Token ID.
3. Token IDs
After tokenization, every token is mapped to a number from the model vocabulary.
Example:
"The" → 791
"capital" → 3139
"of" → 286
"France" → 4881
"is" → 318
Final Token IDs: [791, 3139, 286, 4881, 318]
These IDs act like labels. Similar to how student roll numbers identify students, token IDs identify tokens. However, token IDs itself do not contain any meaning. The number 791 does not tell the model anything about the word "The". It only acts as an identifier. The purpose of token IDs is to converts text into numerical format so that mathematical operations can be performed.
The next step is converting these IDs into embeddings.
4. Embeddings
Token IDs are only numbers and numbers alone does not contain semantic meaning. Therefore they need to be converted into meaningful vector representations.
This process is performed by the Embedding Layer.
Example:
791
↓
[0.12, -0.45, 1.38, ...]
Each token is transformed into a high-dimensional vector.
The purpose of embeddings is to capture semantic relationships between words. Words having similar meanings usually get similar vector representations.
For example:
King and Queen may appears closer in vector space.
Dog and Cat may also appears closer.
Dog and Airplane would usually be far apart.
Embeddings are important because Transformer layers work on vectors, not token IDs. They provide a mathematical representation of language that the model can understands.
5. Positional Information
Transformers process all tokens in parallel. Because of this they do not naturally knows the order of words.
For example:
Dog bites man and Man bites dog
contain same words but completely different meanings.
Without positional information both sentences may looks similar to the model.
To solve this problem positional information is added into embeddings.
Final Input:
Token Embedding + Positional Information
The purpose of positional information is to tell the model where each token appears inside the sentence.
Modern LLMs mostly use RoPE (Rotary Positional Embeddings) because it handle long context much better than traditional positional encoding methods.
6. Transformer Layers
After embeddings are prepared, they are passed through multiple Transformer layers.
This is the stage where most of the language understanding are happens.
Each Transformer layer contains:
- Self-Attention
- Feed Forward Network (FFN)
- Residual Connections
- Layer Normalization
The attention mechanism allow tokens to communicate with each other and understand context.
Example:
The animal didn't cross the street because it was tired.
The model learns:
it → animal
instead of:
it → street
The purpose of Transformer layers is to understand relationships, context and patterns present in the input sequence. This stage performs most of the reasoning inside the model.
7. Logits Generation
After all Transformer layers finish processing, the model generate logits.
Logits are raw prediction scores assigned to every token present in the vocabulary.
Example:
Paris = 12.8
London = 8.4
Rome = 5.1
Delhi = 2.3
At this stage these values are not probabilities they are only scores. The purpose of logits is to indicates which tokens the model currently believes are more likely to appear next. Higher score generally means higher chance of being selected. However logits can not be directly interpreted as probabilities.
Therefore the next step is Softmax.
8. Softmax
Softmax converts logits into probabilities.
Example:
Paris = 85%
London = 7%
Rome = 4%
Delhi = 2%
Now the model have a probability distribution over all possible next tokens. The purpose of Softmax is to make prediction scores easier to interpret and compare. After Softmax, the sum of all probabilities becomes 100%.
The model can now decide which token should be generated.
9. Next Token Prediction
After sampling a token is selected.
Example:
Input: The capital of France is
Generated Token: Paris
This generated token is appended back to the sequence.
New Input: The capital of France is Paris
The purpose of this step is to gradually build the response one token at a time. Instead of generating the whole paragraph at once, the model continuously predicts the next most suitable token.
11. Autoregressive Generation
Modern Decoder-Only LLMs generate text using autoregressive generation. This means the model predicts one token, adds it to the sequence and then predicts the next token again.
Example:
The capital of France is
↓
Paris
↓
.
↓
The
↓
city
↓
is
↓
…
This process keeps repeating until an End-of-Sequence token is generated or the maximum token limit is reached. The purpose of autoregressive generation is to allow the model to generates long and coherent responses while maintaining context from previously generated tokens.
This is the same technique used by modern models such as GPT, LLaMA, Claude and Gemini.