Regression of Auto MPG

Posted Dec 4, 2025 Updated Dec 4, 2025

By Mohamed Lahkim

10 min read

You can veiw the Jupyter notebook file(.ipynb) here

🚗 Advanced Regression for MPG: Chasing the Lowest RMSE

From Linear Models to Gradient Boosting

In our initial analysis, we established a solid baseline using Ridge Regression, which achieved a holdout $\text{RMSE}$ of $\approx 3.39$. To get the “lowest number possible” and win the competition, we must now explore more complex models capable of capturing the non-linear relationships in the data.

This notebook will follow a competitive workflow:

Define a Preprocessing Pipeline: A robust, reusable pipeline for scaling and encoding.
Model 1: Polynomial Regression with Ridge: We will test if adding polynomial features (e.g., $weight^2$, $horsepower^2$) can model the data’s curve, while using hyperparameter tuning to find the best complexity.
Model 2: Gradient Boosting Regressor: We will implement a powerful tree-based ensemble method, a standard for high-performance machine learning, and tune it to manage the bias-variance trade-off.
Model Selection & Final Submission: We will select the model with the lowest $\text{RMSE}$ on our holdout set and train it on the entire dataset for the final submission.

  
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing and Pipelines
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

# Models
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor

# Set plotting style
sns.set_style('whitegrid')

# Load the training and test data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

print("--- Training Data (train.csv) Loaded ---")
print(train_df.info())

print("\n--- Test Data (test.csv) Loaded ---")
print(test_df.info())

# --- 1. Define features and target ---
X = train_df.drop(columns=['ID', 'name', 'mpg01', 'mpg'])
y = train_df['mpg']

# The final test set for submission (ID stored separately)
test_ids = test_df['ID']
X_test_final = test_df.drop(columns=['ID', 'name'])

# --- 2. Define features lists ---
numerical_features = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
categorical_features = ['origin']

# --- 3. Create the master preprocessor ---
# This scales numerical features (vital for Ridge and Poly)
# and one-hot encodes the categorical 'origin' feature.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'
)

# --- 4. Create the Holdout Split ---
# We split the train.csv data to validate our models
X_train, X_holdout, y_train, y_holdout = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Holdout set size: {X_holdout.shape[0]} samples")

--- Training Data (train.csv) Loaded ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            397 non-null    object 
 1   mpg           397 non-null    float64
 2   cylinders     397 non-null    int64  
 3   displacement  397 non-null    float64
 4   horsepower    397 non-null    int64  
 5   weight        397 non-null    int64  
 6   acceleration  397 non-null    float64
 7   year          397 non-null    int64  
 8   origin        397 non-null    int64  
 9   name          397 non-null    object 
 10  mpg01         397 non-null    int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 34.2+ KB
None

--- Test Data (test.csv) Loaded ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ID            397 non-null    object 
 1   cylinders     397 non-null    int64  
 2   displacement  397 non-null    float64
 3   horsepower    397 non-null    int64  
 4   weight        397 non-null    int64  
 5   acceleration  397 non-null    float64
 6   year          397 non-null    int64  
 7   origin        397 non-null    int64  
 8   name          397 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 28.0+ KB
None
Training set size: 277 samples
Holdout set size: 120 samples

Our baseline linear model assumes the relationship between $\text{weight}$ and $\text{mpg}$ is a straight line. This is almost certainly wrong.

Polynomial Regression allows us to create new features by squaring or cubing existing ones (e.g., $weight^2$) and creating interaction features (e.g., $weight \times horsepower$).

Pro (Lower Bias): This allows our model to fit complex curves, better capturing the true signal.
Con (Higher Variance): This can dramatically increase the risk of overfitting the $\text{noise}$ in the data.

To manage this, we will pipeline $\text{PolynomialFeatures}$ with Ridge Regression. The regularization from Ridge will penalize and shrink the coefficients of any useless or noisy polynomial features, finding the best balance. We will use $\text{GridSearchCV}$ to find the best degree (complexity) and alpha (regularization strength).

  
# Create the pipeline
poly_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('poly_features', PolynomialFeatures(include_bias=False)),
    ('regressor', Ridge(random_state=42))
])

# Define the hyperparameters to tune
# We'll test 2nd-degree (quadratic) vs. 3rd-degree (cubic) polynomials
# And test different regularization strengths
param_grid_poly = {
    'poly_features__degree': [2, 3],
    'regressor__alpha': [1.0, 10.0, 100.0]
}

# Grid search with 5-fold cross-validation
grid_search_poly = GridSearchCV(
    poly_pipeline, 
    param_grid_poly, 
    cv=5, 
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

# Train the model
print("Starting Polynomial Grid Search (this may take a moment)...")
grid_search_poly.fit(X_train, y_train)

# Get the best model
best_poly_model = grid_search_poly.best_estimator_

# Evaluate on the holdout set
y_pred_poly = best_poly_model.predict(X_holdout)
rmse_poly = np.sqrt(mean_squared_error(y_holdout, y_pred_poly))

print(f"\n--- Polynomial Regression Results ---")
print(f"Best Hyperparameters: {grid_search_poly.best_params_}")
print(f"Holdout RMSE: {rmse_poly:.4f}")

Starting Polynomial Grid Search (this may take a moment)...

--- Polynomial Regression Results ---
Best Hyperparameters: {'poly_features__degree': 3, 'regressor__alpha': 10.0}
Holdout RMSE: 3.3138

Model	Holdout RMSE	Best Hyperparameters
Ridge Regression	$\approx 3.3988$	`{'alpha': 10.0}`
Polynomial + Ridge	Result from Cell 5	Result from Cell 5
Gradient Boosting	Result from Cell 7	Result from Cell 7

🚗 Advanced Regression for MPG: Chasing the Lowest RMSE

From Linear Models to Gradient Boosting

Trending Tags