Explain the architecture of an ML.NET pipeline.
Explain the architecture of an ML.NET pipeline.
1 Answer
An ML.NET pipeline is a sequence of data processing and machine learning operations that transform raw data into predictions. It follows a modular architecture where each stage performs a specific task, making it easy to build, train, evaluate, and deploy machine learning models within .NET applications.
High-Level Architecture
Raw Data
│
▼
Data Loading
│
▼
Data Preparation / Transformation
│
▼
Feature Engineering
│
▼
Model Training
│
▼
Model Evaluation
│
▼
Model Persistence
│
▼
Prediction Engine / Batch Prediction
1. Data Loading
The first step is to load data into an ML.NET data structure called IDataView.
IDataView is a tabular, lazy-loading data pipeline that efficiently handles large datasets.
Example:
// Create ML context
var mlContext = new MLContext();
// Load data from CSV file
IDataView data = mlContext.Data.LoadFromTextFile<SalesData>(
path: "sales.csv",
hasHeader: true,
separatorChar: ',');
Responsibilities
- Read data from CSV, database, JSON, or in-memory collections
- Define schema
- Enable scalable data processing
2. Data Transformation Layer
Raw data usually cannot be used directly for training.
Transformers clean and convert data into a machine-learning-friendly format.
Common transformations include:
- Missing value replacement
- Normalization
- Text featurization
- One-hot encoding
- Type conversion
Example:
var dataProcessPipeline =
mlContext.Transforms.ReplaceMissingValues("Sales")
.Append(
mlContext.Transforms.NormalizeMinMax("Sales"));
Architecture Role
Raw Data
│
▼
Transformers
│
▼
Processed Data
Each transformation creates a new IDataView without modifying the original data.
3. Feature Engineering
Machine learning algorithms operate on numerical feature vectors.
Feature engineering transforms business data into features suitable for training.
Example:
var featurePipeline =
mlContext.Transforms.Concatenate(
"Features",
nameof(SalesData.Price),
nameof(SalesData.Quantity));
Output:
Price = 100
Quantity = 5
Features = [100, 5]
Common Feature Operations
- Concatenation
- Text embeddings
- Category encoding
- Feature scaling
- Feature selection
4. Training Layer
The trainer learns patterns from historical data.
ML.NET supports:
- Regression
- Classification
- Recommendation
- Clustering
- Anomaly detection
Example:
var trainer =
mlContext.Regression.Trainers.Sdca(
labelColumnName: "Revenue",
featureColumnName: "Features");
Pipeline assembly:
var pipeline =
featurePipeline.Append(trainer);
Architecture
Features
│
▼
Trainer
│
▼
Trained Model
The result is an ITransformer, which contains the learned model.
5. Model Fitting
The pipeline is trained using the Fit() method.
var model = pipeline.Fit(trainingData);
What happens internally:
Training Data
│
▼
Transformations
│
▼
Feature Extraction
│
▼
Learning Algorithm
│
▼
Trained Model
Output:
ITransformer model
This object contains:
- Transformation logic
- Learned parameters
- Prediction workflow
6. Model Evaluation
Evaluation measures model quality.
Example for regression:
var predictions =
model.Transform(testData);
var metrics =
mlContext.Regression.Evaluate(
predictions,
labelColumnName: "Revenue");
Metrics may include:
- R² Score
- RMSE
- MAE
Architecture:
Test Data
│
▼
Model
│
▼
Predictions
│
▼
Metrics
7. Model Persistence
After training, the model can be saved for later use.
mlContext.Model.Save(
model,
trainingData.Schema,
"model.zip");
Loading:
var loadedModel =
mlContext.Model.Load(
"model.zip",
out var schema);
Architecture:
Trained Model
│
▼
model.zip
│
▼
Deployment
8. Prediction Layer
The trained model generates predictions on new data.
Single Prediction
var predictionEngine =
mlContext.Model.CreatePredictionEngine
<SalesData, SalesPrediction>(model);
var result =
predictionEngine.Predict(new SalesData
{
Price = 100,
Quantity = 10
});
Batch Prediction
var predictions =
model.Transform(newData);
Architecture:
New Data
│
▼
Transformers
│
▼
Trained Model
│
▼
Prediction
Complete ML.NET Pipeline Architecture
┌──────────────────┐
│ Raw Dataset │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ IDataView Loader │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Data Cleaning │
│ Normalization │
│ Encoding │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Feature Creation │
│ Features Column │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ ML Trainer │
│ (SDCA/FastTree) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Trained Model │
│ ITransformer │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Evaluation │
│ Metrics │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Save Model │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Predictions │
└──────────────────┘
Key Architectural Components
| Component | Purpose |
|---|---|
MLContext |
Entry point for all ML.NET operations |
IDataView |
Data pipeline abstraction |
| Transformers | Data preparation and feature engineering |
| Estimators | Define training operations |
| Trainers | Learn patterns from data |
ITransformer |
Trained model representation |
| Evaluators | Measure model performance |
| Prediction Engine | Generates predictions |
The most important architectural concept in ML.NET is that a pipeline combines data transformations and model training into a single reusable workflow, ensuring the exact same preprocessing steps are applied during both training and prediction.