The Data Letter

The Data Letter

Build Your First AI Data Pipeline in Python: From Raw CSV to Predictions

A step-by-step scikit-learn tutorial that turns vehicle data into CO2 predictions, and teaches you to read your model’s results honestly

Hodman Murad's avatar
Hodman Murad
May 16, 2026
∙ Paid

Most beginner machine learning tutorials end at model.fit() and model.predict(). They skip the part where your preprocessing has to run the same way on training data and on new data, every single time, in the right order, without you remembering seven different steps. A pipeline solves all of that.

In this tutorial, you’ll build a working scikit-learn pipeline that ingests a raw CSV of vehicle data, fills in missing values, scales the numeric features, encodes the categorical text features, trains a Random Forest model to predict CO2 emissions, scores its predictions, and saves everything to disk for reuse. You’ll also learn how to read your evaluation metrics skeptically, so you know whether your model has learned anything worth shipping.

What you’ll need before you start

You’ll need Python 3.10 or newer, a code editor (I’m using VS Code), and four libraries. Open your terminal and run:

pip install pandas numpy scikit-learn joblib

For the dataset, grab the vehicle emissions CSV linked at the bottom of this article. Drop it into a folder called firstaipipeline on your desktop. Open that folder in VS Code, then create a file called aipipelinetutorial.py next to the CSV. This will be your workspace.

Why pipelines matter for any serious ML work

A pipeline is an object that holds all the steps of your data preparation and your model together in a fixed order. When you call fit(), the pipeline runs each preprocessing step on the training data, learns the parameters it needs (the mean of each column and the set of categories in each text field), and trains the model. When you call predict() on new data, the pipeline applies the same learned parameters in the same order and then runs the model.

Without a pipeline, every time new data arrives, you have to manually fill in missing values, rescale numeric columns, and convert text columns in exactly the same way you did when you trained the model. Different settings on new data, and the model gets confused inputs and returns confused predictions. A pipeline handles both sides of this for you. It applies your training settings to new data automatically, and keeps your test data out of those settings while you’re training.

There’s a second benefit. Once your pipeline is built, swapping models becomes a one-line change. The quality of models improves with a fast pace of iteration.


If you like tutorials like this, I also just published a guide on creating your own AI productivity app on my other Substack, Between Thinking and Doing.


Loading the data and choosing your target

You’ll be working with this vehicle emissions dataset from Kaggle. Download the CSV, drop it in the same folder as your script, and you’re ready to load it.

Open aipipelinetutorial.py and start with the imports:

import joblib

import numpy as np

import pandas as pd

from sklearn.compose import ColumnTransformer

from sklearn.ensemble import RandomForestRegressor

from sklearn.impute import SimpleImputer

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import OneHotEncoder, StandardScaler

Each import has a job:

  • Pipeline chains the preprocessing steps together

  • ColumnTransformer routes different columns through different pipelines

  • SimpleImputer fills missing values

  • StandardScaler puts numbers on a common scale

  • OneHotEncoder turns text categories into numbers

  • RandomForestRegressor is your model

  • The three metrics at the top tell you how well it performed

Now load the data and take a look:

[df = pd.read_csv(”vehicle_emission_dataset.csv”)

print(”Dataset shape:”, df.shape)

print(df.head())

print(df.info())]

You should see a table with 10,000 rows and 19 columns. Some columns are numeric, such as engine size, mileage, and speed. Others are text-based, such as vehicle type, fuel type, and road type. A machine learning model can only read numbers, so the text columns need to be encoded before they ever reach the model. This is one of the jobs your pipeline will handle.

You’re predicting CO2 Emissions. That column has to come out of your feature set, because if the model sees the answer during training, it isn’t learning anything useful.

target = “CO2 Emissions”

leakage_cols = [

“NOx Emissions”,

“PM2.5 Emissions”,

“VOC Emissions”,

“SO2 Emissions”,

“Emission Level”,

]

X = df.drop(columns=[target] + leakage_cols)

y = df[target]

Why drop the other emission columns, too? Because all the emissions in this dataset come from the same source: the engine burning fuel. CO2, NOx, and PM2.5 rise and fall together because they’re produced by the same event. If you let your model see NOx while it’s trying to predict CO2, it’s not really predicting anything. It’s looking up the answer in a different column. Researchers call this data leakage, and it’s one of the easiest ways to fool yourself into thinking you’ve built a great model when you haven’t. Drop these columns now, and you force the model to predict CO2 from the things you’d realistically know about a vehicle before it ever started its engine.

Keep reading with a 7-day free trial

Subscribe to The Data Letter to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2026 Hodman Murad · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture