Introduction
When we talk about modern large Language Models (LLMs) like ChatGPT, Llama, Gemini, Claude or Mistral, So behind the model there is an important technology called Attention Mechanism. The foundation for the Powerful LLMs is based on
Transformer Architecture and the attention is called the heart of transformer.
people think LLMs are intelligent because they have a huge amount of data. Data is important but only data do not make LLMs intelligent because if the model is unable to fins the relationship between words then the model will not be able to generate meaningful response while being trained on billions of data.
To resolve this problem Attention Mechanism is introduced.
Why Encoder-Decoder Models Failed?
To understand Attention we have to understand the problem which is solved using attention. Before the transformers
Encoder-Decoder Architecture in NLP was vey famous.
the flow was like:
Input sentence → Encoder → Context vector → Decoder → output sentence
Example:
English:
The capital of France is Paris
↓ Encoder
Compressed Information
↓ Decoder
French:
La capitale de la France est Paris
Encoder reads the whole sentence and store the compressed sentence (Summary version) into a vector which is known as Context Vector then Decoder use the Context Vector to generate any output.
Real Problem
This approach was good for small sentences.
Example:
The sky is blue.
But when the sentence become longer:
The animal sitting near the river was watching the birds flying over the trees while the weather was changing rapidly.
Then the real problem gets start. Encoder have to store the whole information into a single vector. which creates problems like:
1. Information Bottleneck Problem
means:
More information → Single Fixed vector → Information loss
2. Long Context Problem
Long context problem was the biggest reason for invention of Attention.
Example:
The animal didn't cross the street because it was tired.
Question:
"It" Refers to?
Options:
animal
street
Here in old Encoder-Decoder It is difficult to remember the context of any long sentence. Higher the length of information more the starting information become weak as result model become confuse and generates meaningless responses sometimes.
What is Attention?
The basis Idea of Attention is very simple. Instead of storing all the information in a single vector, Attention says to check the whole sentence again when you have to understand any word.
Attention is a mechanism which helps model to decide which word of the sentence have more probability to understand the meaning and context of current word. Means for every word Query (Q), Keys (K) and Value (V) is generated.
Attention Scores
Attention do not states directly that the word is important or not Instead of this the numerical values are assigned.
Example:-
animal = 8.2
street = 0.5
cross = 1.1
tired = 0.9
These values are called Attention score. It tells which word is useful for current word.
Why Attention Was Revolutionary?
Old model:
Sentence → compress → one vector → output
Attention model:
Sentence → store all words →Access any word anytime →output
Now model gets the access of whole sentence which makes
- Improved Translation
- Improved Summarization
- Improved Question Answering
- Improved Long context understanding
Enter Query, Key and Value
now the question arises how any model decides that which word needs more attention. This problem gives the concept of QKV.
Transformer creates 3 vectors for each tokens:
Query (Q)
Key (K)
Value (V)
Query (Q)
Query means what the model is searching.
Example:
Current word: it
Query ask: tell me which word is related to “it”?
Key (K)
Key means what information a model have, each word have there identity.
Example:
animal → Key
street → Key
cross → Key
Value (V)
Value means the Actual information which will be transfer. If the score of any word is high this value is passed to the current token.
Simplified Flow
Token
↓
Embedding
↓
Q K V
↓
Compare Query with Keys
↓
Generate Attention Scores
↓
Select Important Words
↓
Take their Values
↓
Create Context-Aware Representation
Example
Sentence:
The animal didn't cross the street because it was tired.
For word:
it
Attention output:
animal → 90%
street → 2%
cross → 5%
tired → 3%
Final conclusion:
it = animal
And model understand correctly that:
The animal was tired.
Attention vs Embeddings
Embedding Answers "What does this word mean?".
Example:
dog → vector
cat → vector
It represent the meaning of the word.
Attention Answers “how this word is related to other words?”.
Example:
bank
Sentence 1: I deposited money in the bank.
Here attention will focus on money and deposit. means financial bank
Sentence 2: I sat near the river bank.
here attention will focus on river and near. means River bank.
By this the modern LLMs are able to understand the context.