Self-Attention and Multi-Head Attention in Large Language Models (LLMs)


Introduction

Attention Mechanism completely transform the concept of Natural Language Processing (NLP) and Large Language Model (LLM).  Attention provides capability to a model such that they can find the relationship between the different words in a sentence and gather the information with understanding of context.
But modern Transformers do not use Attention only.  The actual core component of Transformer is Self-attention and Multi-Head Attention.

Self-Attention provides the ability to the model such that they can compare token of a sentence with the other tokens of the same sentence whereas multi-head attention provides capability to understand the sentence from different perspectives.

What is Self-Attention?

Self-Attention is a mechanism in which current token of any sentence analyze the other tokens of the same sentence and decides which token is more focused to understand the context of the current token.

Example:

The animal didn't cross the street because it was tired.

Here when the model will process it, then model will check the whole sentence and then find the possible relationships like:

animal
cross
street
tired

Then model will calculate:

animal → High Importance
street → Low Importance
cross → Medium Importance

and create the conclusion like:

it = animal

This is the basic Idea of Self-attention.

Why is it Called Self-Attention?

The word "Self" means sentence is attending itself. 

Example:

I love learning Artificial Intelligence.

here:

I
love
learning
Artificial
Intelligence

All the words can communicate with each other.  so, sentence attends to itself.

Self-Attention Flow

Input Tokens
      ↓
Embeddings
      ↓
Query, Key, Value
      ↓
Attention Scores
      ↓
Softmax
      ↓
Weighted Values
      ↓
Context-Aware Representation

This process is repeated for each token.

Why Self-Attention is Powerful?

Before Self-Attention the word have to be understand individually but after Self-Attention the word is being understand with the context of the sentence.

Example:

Sentence 1: I deposited money in the bank.
Sentence 2: I sat near the river bank.

Word: bank, is same but the Self-attention will check the nearby words like: 

money, deposited and understand Financial Bank 
and river, near and understand River Bank

Limitation of Single Self-Attention

A question can arise if Self-attention is much powerful then why we need Multi-head attention.  The reason is that the Single self-attention Layer focuses on a single type of relation.

Example sentence:

The boy who studied hard passed the exam.

Sentence have multiple relationships:

boy ↔ studied
boy ↔ passed
hard ↔ studied
exam ↔ passed

A single attention mechanism can not capture the all relationships perfectly.

What is Multi-Head Attention?

The Multi-Head attention idea is generated by using multiple attentions instead of using single attemtion layer.

Example:

Head 1
Head 2
Head 3
Head 4

Each head will analyze the sentence with different perspectives.

Different Heads Learn Different Things

Example: 

The boy who studied hard passed the exam.
Head 1: Grammar Relationship
Head 2: Subject-Verb Relationship
Head 3: Long Distance Dependency
Head 4: Semantic Meaning

each head learns different patterns.

Multi-Head Attention Architecture

Input Embeddings
        ↓
   Split into Heads
        ↓
  Head 1 Attention
  Head 2 Attention
  Head 3 Attention
  Head 4 Attention
        ↓
     Concatenate
        ↓
    Linear Layer
        ↓
       Output

Example

Sentence: The capital of France is Paris.
Head 1: capital ↔ Paris
Head 2: France ↔ Paris
Head 3: The ↔ capital
Head 4: Sentence Structure

Every heads generates richer relationship collectively. 

Why Multi-Head Attention is Important?

Benefits:

  • Better Context Understanding
  • Better Long-Range Dependencies
  • Better Semantic Understanding
  • Better Reasoning

Self-Attention vs Multi-Head Attention

Self-Attention Multi-Head Attention
Single attention mechanism Multiple attention mechanisms
One perspective Multiple perspectives
Limited pattern capture Rich pattern capture
Simpler More powerful
Lower computation Higher computation

Where is Multi-Head Attention Used?

Almost every modern LLM:

  • OpenAI GPT Models
  • Meta LLaMA
  • Google Gemini
  • Mistral AI Mistral
  • Anthropic Claude

uses Multi-Head Attention.

0 Comments Report