Introduction
Attention Mechanism completely transform the concept of Natural Language Processing (NLP) and Large Language Model (LLM). Attention provides capability to a model such that they can find the relationship between the different words in a sentence and gather the information with understanding of context.
But modern Transformers do not use Attention only. The actual core component of Transformer is Self-attention and Multi-Head Attention.
Self-Attention provides the ability to the model such that they can compare token of a sentence with the other tokens of the same sentence whereas multi-head attention provides capability to understand the sentence from different perspectives.
What is Self-Attention?
Self-Attention is a mechanism in which current token of any sentence analyze the other tokens of the same sentence and decides which token is more focused to understand the context of the current token.
Example:
The animal didn't cross the street because it was tired.
Here when the model will process it, then model will check the whole sentence and then find the possible relationships like:
animal
cross
street
tired
Then model will calculate:
animal → High Importance
street → Low Importance
cross → Medium Importance
and create the conclusion like:
it = animal
This is the basic Idea of Self-attention.
Why is it Called Self-Attention?
The word "Self" means sentence is attending itself.
Example:
I love learning Artificial Intelligence.
here:
I
love
learning
Artificial
Intelligence
All the words can communicate with each other. so, sentence attends to itself.
Self-Attention Flow
Input Tokens
↓
Embeddings
↓
Query, Key, Value
↓
Attention Scores
↓
Softmax
↓
Weighted Values
↓
Context-Aware Representation
This process is repeated for each token.
Why Self-Attention is Powerful?
Before Self-Attention the word have to be understand individually but after Self-Attention the word is being understand with the context of the sentence.
Example:
Sentence 1: I deposited money in the bank.
Sentence 2: I sat near the river bank.
Word: bank, is same but the Self-attention will check the nearby words like:
money, deposited and understand Financial Bank
and river, near and understand River Bank
Limitation of Single Self-Attention
A question can arise if Self-attention is much powerful then why we need Multi-head attention. The reason is that the Single self-attention Layer focuses on a single type of relation.
Example sentence:
The boy who studied hard passed the exam.
Sentence have multiple relationships:
boy ↔ studied
boy ↔ passed
hard ↔ studied
exam ↔ passed
A single attention mechanism can not capture the all relationships perfectly.
What is Multi-Head Attention?
The Multi-Head attention idea is generated by using multiple attentions instead of using single attemtion layer.
Example:
Head 1
Head 2
Head 3
Head 4
Each head will analyze the sentence with different perspectives.
Different Heads Learn Different Things
Example:
The boy who studied hard passed the exam.
Head 1: Grammar Relationship
Head 2: Subject-Verb Relationship
Head 3: Long Distance Dependency
Head 4: Semantic Meaning
each head learns different patterns.
Multi-Head Attention Architecture
Input Embeddings
↓
Split into Heads
↓
Head 1 Attention
Head 2 Attention
Head 3 Attention
Head 4 Attention
↓
Concatenate
↓
Linear Layer
↓
Output
Example
Sentence: The capital of France is Paris.
Head 1: capital ↔ Paris
Head 2: France ↔ Paris
Head 3: The ↔ capital
Head 4: Sentence Structure
Every heads generates richer relationship collectively.
Why Multi-Head Attention is Important?
Benefits:
- Better Context Understanding
- Better Long-Range Dependencies
- Better Semantic Understanding
- Better Reasoning
Self-Attention vs Multi-Head Attention
| Self-Attention | Multi-Head Attention |
|---|---|
| Single attention mechanism | Multiple attention mechanisms |
| One perspective | Multiple perspectives |
| Limited pattern capture | Rich pattern capture |
| Simpler | More powerful |
| Lower computation | Higher computation |
Where is Multi-Head Attention Used?
Almost every modern LLM:
- OpenAI GPT Models
- Meta LLaMA
- Google Gemini
- Mistral AI Mistral
- Anthropic Claude
uses Multi-Head Attention.