Tokenization Techniques in Large Language Models (LLMs)


What is tokenization?

Tokenization is a process to break the text like statement or any prompt into smaller units so that AI and Natural Language Processing (NLP) can understand and process human language more efficiently.  

Example :

original text - 
Explain the concept of tokenization in AI system

after tokenization -
["Explain", "The ", "Concept" ," Of", "Tokenization", "In", “AI”, "System"]

Why tokenization is important :

  1. Help AI to understand the text
  2. Faster process
  3. Vocabulary management 
  4. Important for NLP task

1. Character Tokenization

In character Tokenization each character is considered as a token.

Python Implementation:

text = "I like ML"
tokens = list(text)
print(tokens)

C# Implementation:

using System;
class Program
{
	static void Main()
	{
		string text= "I like ML";
		char[] tokens= text.ToCharArray();
		foreach(char token in tokens)
		{
			Console.WriteLine(token);
		}
	}
}

Output:

['I',' ','l','i','k','e',' ','M','L']

2. Word Tokenization

Word Tokenization divide the sentences in Words.

Python Implementation:

text = "I like Machine Learning"
tokens= text.split()
print(tokens)

C# Implementation:

using System;
class Program
{
    static void Main()
    {
        string text="I like Machine Learning";
        string[] tokens=text.Split(' ');
        foreach(string token in tokens)
        {
            Console.WriteLine(token);
        }
    }
}

Output:

['I','like','Machine','Learning']

3. Sentence Tokenization

Sentence tokenization is used to divides the paragraph into multiple sentences.

Python Implementation:

import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")
text= """I like ML.  Machine Learning is amazing.  LLMs are powerful."""
sentences= sent_tokenize(text)
print(sentences)

C# Implementation: 

using System;
class Program
{
    static void Main()
    {
        string text=
"I like ML. Machine Learning is amazing. LLMs are powerful.";

        string[] sentences = text.Split('.');
        foreach(string sentence in sentences)
        {
            if (!string.IsNullOrWhiteSpace(sentence))
            {
                Console.WriteLine(sentence.Trim());
            }
        }
    }
}

Output:

['I like ML.', 'Machine Learning is amazing.', 'LLMs are powerful.']

4. Regular Expression (Regex) Tokenization

Regular Expression (Regex) tokenization remove the punctuations and generates clean tokens.

Python Implementation:

import re
text= "Hello, World! How are you?"
tokens= re.findall(r"\w+", text)
print(tokens)

C# Implementation:

using System;
using System.Text.RegularExpressions;
class Program
{
    static void Main()
    {
        string text="Hello, World! How are you?";
        MatchCollection matches=
Regex.Matches(text, @"\w+");
        foreach(Match match in matches)
        {
            Console.WriteLine(match.Value);
        }
    }
}

Output:

['Hello','World','How','are','you']

5. Converting Tokens into Token IDs

LLMs do not process the text directly, each tokens are converted into ID.

Python Implementation:

vocab= {"I": 0,"like": 1,"ML": 2}
tokens = ["I","like","ML"]
token_ids= [vocab[token] for token in tokens]
print(token_ids)

C# Implementation: 

using System;
using System.Collections.Generic;
class Program
{
    static void Main()
    {
        Dictionary<string, int> vocab =
 new Dictionary<string, int>()
        {
            {"I", 0},
            {"like", 1},
            {"ML", 2}
        };

        string[] tokens={"I", "like", "ML"};
        List<int> tokenIds=
new List<int>();
        foreach (string token in tokens)
        {
            tokenIds.Add(vocab[token]);
        }
        Console.WriteLine(
"[" + string.Join(", ", tokenIds) + "]"
);
    }
}

Output:

[0,1,2]

6. BERT WordPiece Tokenization

Modern LLMs do not use simple word tokenization, instead of this they use BERT WordPiece tokenization.

Install for Python:

pip install transformers

for C#:

dotnet add package Microsoft.ML.Tokenizers

Python Implementation:

from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained("bert-base-uncased")
tokens=tokenizer.tokenize("unbelievable")
print(tokens)

C# Implementation:

using System;
using Microsoft.ML.Tokenizers;
class Program
{
    static void Main()
    {
        var tokenizer=BertTokenizer.Create(
"vocab.txt",
toLowercase: true
);
        var tokens=tokenizer.Tokenize("unbelievable");
        foreach(var token in tokens)
        {
            Console.WriteLine(token);
        }
    }
}

Output:

['un','##bel','##ie','##vable']

7. BERT Token IDs

Python Implementation:

from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained("bert-base-uncased")
encoded=tokenizer("I like ML")
print(encoded["input_ids"])

C# Implementation: 

using System;
using Microsoft.ML.Tokenizers;
class Program
{
    static void Main()
    {
        var tokenizer=
BertTokenizer.Create(
"vocab.txt",
toLowercase: true
);
        var encoding=
tokenizer.EncodeToIds(
"I like ML"
);
        foreach(var id in encoding)
        {
            Console.WriteLine(id);
        }
    }
}

Output:

[101,1045,2293,9932,102]

Where:

101 = [CLS]
102 = [SEP]
0 Comments Report