Understanding Token, Context window and CPU vs GPU in ollama


Introduction

when we select any AI model, we often get to interact with some terminologies repeatedly like:

  • 8K context window
  • 32K context window
  • 128k context window
  • CPU
  • GPU
  • RAM 
  • VRAM

sometime beginners may get to confuse with what are these terminologies.  In this blog we will know about what are the tokens, context window and what is the use of CPU and GPU in ollama.

what are tokens?

we use to understand human language in the form of words and sentences.  But AI models is not able to understand the human language directly, AI model divides the text into chunks for AI model's understandability which we call tokens. 

Example:

Hello world 
this can be divided into 2 tokens, i.e
['Hello','world']

in simple words, Token is the basic unit of language for any AI model.

Why are Tokens important?

All the processing of AI models either taking any input or generating any output AI model do processing in the form of tokens only.

User's Prompt is Input Token
Model's Response is Output Tokens

Tokens are consumed by the AI model as per the length of the conversation.

What is Context Window?

context window refers to a maximum amount of text, conversation history or other information that an AI model can handle At a single time, Context window also defines that how much information (Input, Conversation history, documents, etc) does a model can process and keep track of at one time while generating any response. 

example:

suppose any AI model has context window of 8K tokens which means that AI model can process the information of approximately 8000 tokens at a time.

when the model reach to the limit of token size it removes the most older information. 

some common window sizes are:

  • 8k context
  • 32k context
  • 64k context
  • 128k context

CPU vs GPU in ollama 

now a question arises who is responsible of run the AI models CPU or GPU, so both can run the AI models, but their performances are different.   

can ollama run on CPU ?

Yes, CPU (Central Processing Unit) is the main processor of the computer.  If any system do not have any dedicated GPU (Graphics Processing unit) then also ollama can run on CPU.  It is possible to run ollama model on CPU but the speed of response can be comparatively slow. 

GPU (Graphics processing unit) plays a powerful role in the field of Artificial Intelligence models.  GPU are capable of performing thousands of operations parallelly which enables AI model to generate very fast responses.

Note:- RAM is the system's memory and VRAM is the memory of GPU.

CPU vs GPU Comparison

Feature CPU GPU
Speed Slower Faster
AI Performance Good Excellent
Cost Lower Higher
Power Consumption Lower Higher
Large Models Difficult Better

 Which hardware is best for beginners?

Low end PC: for models like Phi, Gemma, Mistral the recommended hardware is 8GB RAM.

Mid range PC: for models like Llama 3 8B, DeepSeek, Qwen the recommended hardware is 16GB RAM.

High end system: for models like 70B Models, Vision Models, Advanced Reasoning Models the recommended hardware is 32GB+ RAM and dedicated GPU.

read more about ollama:

previous topic: building your first ai chatbot using ollama and python
next topic: understanding Quantization and ollama model storage

0 Comments Report