Attention mechanism in LLMs


Introduction

When we talk about modern large Language Models (LLMs) like ChatGPT, Llama, Gemini, Claude or Mistral, So behind the model there is an important technology called Attention Mechanism.  The foundation for the Powerful LLMs is based on Transformer Architecture and the attention is called the heart of transformer. 
people think LLMs are intelligent because they have a huge amount of data.  Data is important but only data do not make LLMs intelligent because if the model is unable to fins the relationship between words then the model will not be able to generate meaningful response while being trained on billions of data.
To resolve this problem Attention Mechanism is introduced. 

Why Encoder-Decoder Models Failed?

To understand Attention we have to understand the problem which is solved using attention.  Before the transformers Encoder-Decoder Architecture in NLP was vey famous.
the flow was like:

Input sentence → Encoder → Context vector → Decoder → output sentence

Example:

English:
The capital of France is Paris
↓ Encoder
Compressed Information
↓ Decoder
French:
La capitale de la France est Paris

Encoder reads the whole sentence and store the compressed sentence (Summary version) into a vector which is known as Context Vector then Decoder use the Context Vector to generate any output.

Real Problem

This approach was good for small sentences.

Example:

The sky is blue.

But when the sentence become longer:

The animal sitting near the river was watching the birds flying over the trees while the weather was changing rapidly.

Then the real problem gets start.  Encoder have to store the whole information into a single vector.  which creates problems like:

1. Information Bottleneck Problem

means:

More information → Single Fixed vector → Information loss

2. Long Context Problem

Long context problem was the biggest reason for invention of Attention.

Example:

The animal didn't cross the street because it was tired.

Question:

"It" Refers to?

Options:

animal
street

Here in old Encoder-Decoder It is difficult to remember the context of any long sentence.  Higher the length of information more the starting information become weak as result model become confuse and generates meaningless responses sometimes.

What is Attention?

The basis Idea of Attention is very simple.  Instead of storing all the information in a single vector, Attention says to check the whole sentence again when you have to understand any word.

Attention is a mechanism which helps model to decide which word of the sentence have more probability to understand the meaning and context of current word.  Means for every word Query (Q), Keys (K) and Value (V) is generated.

Attention Scores

Attention do not states directly that the word is important or not Instead of this the numerical values are assigned.

Example:-

animal = 8.2
street = 0.5
cross = 1.1
tired = 0.9

These values are called Attention score.  It tells which word is useful for current word.

Why Attention Was Revolutionary?

Old model:

Sentence → compress → one vector → output

Attention model:

Sentence → store all words →Access any word anytime →output

Now model gets the access of whole sentence which makes

  • Improved  Translation 
  • Improved Summarization
  • Improved Question Answering
  • Improved Long context understanding

Enter Query, Key and Value

now the question arises how any model decides that which word needs more attention.  This problem gives the concept of QKV.

Transformer creates 3 vectors for each tokens: 

Query (Q)
Key (K)
Value (V)

Query (Q)

Query means what the model is searching.

Example:

Current word: it
Query ask: tell me which word is related to “it”?

Key (K)

Key means what information a model have, each word have there identity.

Example:

animal → Key
street → Key
cross → Key

Value (V)

Value means the Actual information which will be transfer.  If the score of any word is high this value is passed to the current token.

Simplified Flow

Token
 ↓
Embedding
 ↓
Q K V
 ↓
Compare Query with Keys
 ↓
Generate Attention Scores
 ↓
Select Important Words
 ↓
Take their Values
 ↓
Create Context-Aware Representation

Example

Sentence:

The animal didn't cross the street because it was tired.

For word:

it

Attention output:

animal → 90%
street → 2%
cross → 5%
tired → 3%

Final conclusion:

it = animal

And model understand correctly that:

The animal was tired.

Attention vs Embeddings

Embedding Answers  "What does this word mean?".

Example:

dog → vector
cat → vector

It represent the meaning of the word.

Attention Answers “how this word is related to other words?”.

Example:

bank
Sentence 1: I deposited money in the bank.

Here attention will focus on money and deposit.  means financial bank

Sentence 2: I sat near the river bank.

here attention will focus on river and near.  means River bank.

By this the modern LLMs are able to understand the context.

0 Comments Report