Positional Encoding and RoPE in LLMs


Introduction

The foundation of some modern Large Language Models (LLMs) like ChatGPT, Llama, Gemini and Claude is Transformers.  Transformers are capable of processing all the Tokens in parallel.  Transformer process the whole sentence simultaneously instead of processing the word one by one like RNN and LSTM.

Since the Transformers process the sentence in parallel which make them extremely fast, it also creates new problem like the model does not know the order of words in a sentence.

example:

Dog bites man
and
Man bites dog

contain exactly the same words but have completely different meanings.

For humans, understanding the order of words is natural. But Transformers only see numerical vectors and they cannot understand the position of words automatically. Without positional information every token look independent and model lose the sentence structure.

To solve this problem, Transformers use Positional Encoding. Positional Encoding provide positional information of each token which help the model to understand the order of words while processing all tokens in parallel.

As the Transformer architecture evolved, researchers introduced various methods to represent the positional information. Some common approaches are:

  • Sinusoidal Positional Encoding
  • Learned Positional Embeddings
  • Rotary Positional Embeddings (RoPE)

Among all these approaches, RoPE become one of the most popular positional technique used in modern LLMs. It helps model to understand long context better and improve the performance on longer sequences.

Before learning about RoPE, first we have to understand why positional information is needed in Transformers and how traditional positional encoding works.

Why Do Transformers Need Positional Information?

Let's understand the problem, suppose we have any sentence:

I love AI

After tokenization:

["I", "love", "AI"]

After embeddings:

I     → [0.2, 0.5, ...]
love  → [1.1, -0.4, ...]
AI    → [0.8, 0.7, ...]

The Transformer receives only vectors.

It does not know:

I comes first
love comes second
AI comes third

For the model:

[0.2,0.5,...]
[1.1,-0.4,...]
[0.8,0.7,...]

are simply vectors but here no position information exists.

What Problem Does This Create?

Consider two sentences:

sentence 1: Dog bites man
sentence 2: Man bites dog

Embeddings only represent meaning, they do not represent order.

Without positional information:

Dog Bites Man
and
Man Bites Dog

look very similar to the Transformer. But the meanings are completely different.  so the transformer Needs Position Information.

Solution: Positional Encoding

Researchers introduced Positional Encoding the idea of Positional Encoding is simple every token will receive Token Meaning and position information.

Final Input:

Final Embedding = Token Embedding + Positional Encoding

Example:

Token Embedding = [0.2, 0.4, 0.6]
Position 1 Encoding = [0.01, 0.02, 0.03]

Final:

[0.21, 0.42, 0.63]

Now the model knows both the meaning and position.

Types of Positional Encoding

Over time researchers proposed different methods:

  1. Sinusoidal Positional Encoding
  2. Learned Positional Embeddings
  3. Rotary Positional Embeddings (RoPE)

1. Sinusoidal Positional Encoding

This was introduced in the Transformer research paper named Attention Is All You Need (2017).  Instead of learning positions, researchers generated them using mathematical functions. In transformers the encoding uses Sine and Cosine functions. such that each position receives a unique pattern.

Example:

Position 1 → Pattern A
Position 2 → Pattern B
Position 3 → Pattern C

Every position becomes mathematically unique.

Why Sine and Cosine?

Because they provide Smooth Patterns, Periodic behaviour and Relative Distance Information.  here nearby positions receive similar representations and the far positions receive different representations which helps Attention understand sequence order.

  • Limitations of Sinusoidal Encoding
  • Fixed Representation
  • Less Flexible
  • Long Context Issues

Learned Positional Embeddings

Here Instead of generating positions mathematically the model learns them.

Example:

Position 1 → Learned Vector
Position 2 → Learned Vector 
Position 3 → Learned Vector

Similar to token embeddings.

During training:

Backpropagation
↓
Position Vectors Updated

The model learns optimal position information.

Advantages

What is RoPE?

RoPE stands for Rotary Positional Embeddings.  RoPE was introduced to provide positional information directly inside the Attention mechanism.

Instead of:

Embedding + Position

RoPE modifies:

Query
Key

vectors before Attention calculation which is a major difference.

Why Modern LLMs Use RoPE

RoPE provides several benefits like: 

  • Better Long Context Understanding
  • Relative Position Awareness
  • Better Generalization
  • Efficient Computation

Most modern LLMs use RoPE like:

  • Llama
  • Mistral
  • Qwen
  • DeepSeek
  • GPT-style architectures

RoPE has become the industry standard positional encoding technique.

Positional Encoding vs RoPE

Positional Encoding RoPE
Added to embeddings Applied to Q and K
Absolute positions Relative positions
Limited long context Better long context
Less scalable Highly scalable
Original Transformer Modern LLMs
0 Comments Report