What is Quantization?
Quantization is a technique which is used to reduce the size of AI models such that the AI model can also run on normal computers or laptops. In simple words we can say that Quantization is process to make AI model less in size and lightweight.
In AI models there are billions of parameters and numerical values stored which is normally stored in high precision which causes the large size of AI model which requires more RAM and powerful hardware to run. Quantization reduces the precision of the values which reduces the size of the AI model and also reduces the requirement of hardware.
Example:
suppose any AI model stores the value like this:
0.987654321
After Quantization the value is change as something like this:
0.99
It caused some precision loss but the storage an memory usage is significantly reduced. That's why Quantization model use less storage, consumes less RAM, Load faster and also run on normal laptops.
Why do we need Quantization?
Original AI models are very large in size and to run these AI model highly powerful hardware are required.
Example:
Original Model
↓
Large Storage Requirement
↓
More RAM Usage
↓
High Hardware Requirement
These models can not run in each system. So to solve these problem Quantization is used.
Common Quantization Types
When you download ollama oy any AI model then you may introduce by some terms like:
Q2
Q3
Q4
Q5
Q6
Q8
this represents the Quantization level of the model. In simple words It tells about the how the precision value is stored of any model. Generally, Quantization level defines the size of model but differ can be visible in terms of accuracy and response.
Q2 Quantization
Its an very aggressive quantization, these are the very small in size models, They consumes the minimum RAM, There can be noticeable reduction in Accuracy.
Q2 Quantization is Suitable for:
- Very low-end systems
- Testing purposes
Q4 Quantization
This is the one of the popular Quantization level. It maintains good balance between quality and performance, They consumes less RAM and faster inference. Many ollama models are available in Q4 in default.
Q4 Quantization is Suitable for:
- Most beginners
- Normal laptops
- Everyday usage
Q5 Quantization
It provides the balance between Q4 and Q8 Quantization. It provides better accuracy than Q4, usage manageable hardware requirements.
Q5 Quantization is Suitable for:
- Users who want slightly better quality
- Mid-range systems
Q6 Quantization
Q6 provides higher precision, Better response quality but it consumes more RAM in compare of Q4 and Q5.
Q6 Quantization is Suitable for:
- Advanced users
- Systems with sufficient RAM
Q8 Quantization
It is one of the highest commonly used quantization levels. It provides Better response quality, more accurate outputs and these are closer to the original model. but these are large size model and it requires higher RAM than other quantization models.
Q8 Quantization is Suitable for:
- Powerful systems
- Users who prioritize quality over speed
Comparison Table
| Quantization | Model Size | RAM Usage | Speed | Accuracy |
|---|---|---|---|---|
| Q2 | Very Small | Very Low | Very Fast | Low |
| Q4 | Small | Low | Fast | Good |
| Q5 | Medium | Medium | Good | Better |
| Q6 | Larger | Higher | Moderate | Very Good |
| Q8 | Largest | Highest | Slower | Best |
Understanding Ollama Model storage
now we have to understand where the model originally gets stored when we download the model. when we run the command
ollama pull llama3
then ollama download the model from internet and stores in local system.
Where Does Ollama Store Models?
Windows
C:\Users\<username>\.ollama\models
Linux
~/.ollama/models
macOS
~/.ollama/models
these location are ollama storage folder.
Understanding the Folder Structure
Models folder is consist my mainly two important folders which are:
models
│
├── blobs
├── manifests
What is the Blobs Folder?
Blobs folder stores the actual data of the model. This folder is consist of AI model files. These are the normal model data files.
What is the Manifest Folder?
This folder stores the information of the model like model name, version, configuration, tags, etc.
Understanding Storage is important because it can full the storage of c drive when multiple highly weighted model are load in any system.
read more about ollama:
previous topic:
understanding Token, context window and CPU vs GPU in ollama
next topic:
Understanding Modelfiles in Ollama