What is tokenization?
Tokenization is a process to break the text like statement or any prompt into smaller units so that AI and Natural Language Processing (NLP) can understand and process human language more efficiently.
Example :
original text -
Explain the concept of tokenization in AI system
after tokenization -
["Explain", "The ", "Concept" ," Of", "Tokenization", "In", “AI”, "System"]
Why tokenization is important :
- Help AI to understand the text
- Faster process
- Vocabulary management
- Important for NLP task
1. Character Tokenization
In character Tokenization each character is considered as a token.
Python Implementation:
text = "I like ML"
tokens = list(text)
print(tokens)
C# Implementation:
using System;
class Program
{
static void Main()
{
string text= "I like ML";
char[] tokens= text.ToCharArray();
foreach(char token in tokens)
{
Console.WriteLine(token);
}
}
}
Output:
['I',' ','l','i','k','e',' ','M','L']
2. Word Tokenization
Word Tokenization divide the sentences in Words.
Python Implementation:
text = "I like Machine Learning"
tokens= text.split()
print(tokens)
C# Implementation:
using System;
class Program
{
static void Main()
{
string text="I like Machine Learning";
string[] tokens=text.Split(' ');
foreach(string token in tokens)
{
Console.WriteLine(token);
}
}
}
Output:
['I','like','Machine','Learning']
3. Sentence Tokenization
Sentence tokenization is used to divides the paragraph into multiple sentences.
Python Implementation:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")
text= """I like ML. Machine Learning is amazing. LLMs are powerful."""
sentences= sent_tokenize(text)
print(sentences)
C# Implementation:
using System;
class Program
{
static void Main()
{
string text=
"I like ML. Machine Learning is amazing. LLMs are powerful.";
string[] sentences = text.Split('.');
foreach(string sentence in sentences)
{
if (!string.IsNullOrWhiteSpace(sentence))
{
Console.WriteLine(sentence.Trim());
}
}
}
}
Output:
['I like ML.', 'Machine Learning is amazing.', 'LLMs are powerful.']
4. Regular Expression (Regex) Tokenization
Regular Expression (Regex) tokenization remove the punctuations and generates clean tokens.
Python Implementation:
import re
text= "Hello, World! How are you?"
tokens= re.findall(r"\w+", text)
print(tokens)
C# Implementation:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
string text="Hello, World! How are you?";
MatchCollection matches=
Regex.Matches(text, @"\w+");
foreach(Match match in matches)
{
Console.WriteLine(match.Value);
}
}
}
Output:
['Hello','World','How','are','you']
5. Converting Tokens into Token IDs
LLMs do not process the text directly, each tokens are converted into ID.
Python Implementation:
vocab= {"I": 0,"like": 1,"ML": 2}
tokens = ["I","like","ML"]
token_ids= [vocab[token] for token in tokens]
print(token_ids)
C# Implementation:
using System;
using System.Collections.Generic;
class Program
{
static void Main()
{
Dictionary<string, int> vocab =
new Dictionary<string, int>()
{
{"I", 0},
{"like", 1},
{"ML", 2}
};
string[] tokens={"I", "like", "ML"};
List<int> tokenIds=
new List<int>();
foreach (string token in tokens)
{
tokenIds.Add(vocab[token]);
}
Console.WriteLine(
"[" + string.Join(", ", tokenIds) + "]"
);
}
}
Output:
[0,1,2]
6. BERT WordPiece Tokenization
Modern LLMs do not use simple word tokenization, instead of this they use BERT WordPiece tokenization.
Install for Python:
pip install transformers
for C#:
dotnet add package Microsoft.ML.Tokenizers
Python Implementation:
from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained("bert-base-uncased")
tokens=tokenizer.tokenize("unbelievable")
print(tokens)
C# Implementation:
using System;
using Microsoft.ML.Tokenizers;
class Program
{
static void Main()
{
var tokenizer=BertTokenizer.Create(
"vocab.txt",
toLowercase: true
);
var tokens=tokenizer.Tokenize("unbelievable");
foreach(var token in tokens)
{
Console.WriteLine(token);
}
}
}
Output:
['un','##bel','##ie','##vable']
7. BERT Token IDs
Python Implementation:
from transformers import BertTokenizer
tokenizer=BertTokenizer.from_pretrained("bert-base-uncased")
encoded=tokenizer("I like ML")
print(encoded["input_ids"])
C# Implementation:
using System;
using Microsoft.ML.Tokenizers;
class Program
{
static void Main()
{
var tokenizer=
BertTokenizer.Create(
"vocab.txt",
toLowercase: true
);
var encoding=
tokenizer.EncodeToIds(
"I like ML"
);
foreach(var id in encoding)
{
Console.WriteLine(id);
}
}
}
Output:
[101,1045,2293,9932,102]
Where:
101 = [CLS]
102 = [SEP]