This post distills 12 Jupyter notebooks into a single actionable reference. Whether you're building your first regression model or debugging a stubborn overfit, every pattern you need is here — with runnable code, not vague advice.
Master the Pipeline pattern first. Everything else is a plugin.
01 · Foundation
The Universal Pipeline Pattern
Every single model in this guide follows the same skeleton. Sklearn's Pipeline chains preprocessing and modeling steps together so that data transformations are always fit on training data only — never leaking test information during cross-validation.
Raw Data
→
Preprocessor
→
Estimator
→
cross_validate
→
Report Errors
python · universal template
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
# ① Build: chain preprocessing + model
pipeline = Pipeline([
("feature_scaling", StandardScaler()),
("model", YourRegressor())
])
# ② Train with cross-validation
cv_results = cross_validate(
pipeline, X_train, y_train,
cv=cv,
scoring="neg_mean_absolute_error",
return_train_score=True,
return_estimator=True # keep trained models per fold
)
# ③ Extract errors (scores are negative by sklearn convention)
train_error = -1 * cv_results['train_score']
test_error = -1 * cv_results['test_score']
print(f"Train: {train_error.mean():.3f} ± {train_error.std():.3f}")
print(f"Test: {test_error.mean():.3f} ± {test_error.std():.3f}")
# ④ Pick best model from CV folds
best_idx = test_error.argmin()
best_model = cv_results['estimator'][best_idx]
✓ Why Pipeline prevents leakage
Inside each CV fold, pipeline.fit(X_train_fold) fits the scaler on fold's training data only. The test fold is transformed but never seen during fitting. Without a Pipeline, you'd accidentally scale on all data.
⚠ The -1 trick
sklearn uses "higher is better" universally. MAE is negative because minimizing error = maximizing negative error. Always multiply by -1 before reporting.
Access pipeline steps with pipeline[-1] (last step / model) or pipeline['step_name'] by name.
02 · Setup
Data Loading & Splitting
python · data setup
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, ShuffleSplit
# Load
features, labels = fetch_california_housing(as_frame=True, return_X_y=True)
labels *= 100 # scale to $k (optional, for readability)
# Split 1 — hold out test set. Touch it ONLY for final evaluation.
com_train_features, test_features, com_train_labels, test_labels = \
train_test_split(features, labels, random_state=42)
# Split 2 — dev set for quick sanity checks during experiments
train_features, dev_features, train_labels, dev_labels = \
train_test_split(com_train_features, com_train_labels, random_state=42)
# Cross-validation strategy (use com_train_* for all CV)
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
🚫 Golden Rule
Never touch test_features during model development. It exists only for the single final report. All tuning, CV, and diagnostics go on com_train_*.
10-Step Dataset Exploration Checklist
| # | Step | Code |
| 1 | Description | print(data.DESCR) |
| 2 | Feature shape | features.shape |
| 3 | Label shape | labels.shape |
| 4 | Feature names | features.columns |
| 5 | Sample rows | data.frame.head() |
| 6 | Dtypes / nulls | data.frame.info() |
| 7 | Statistics | data.frame.describe() |
| 8 | Histograms | data.frame.hist(figsize=(12,10), bins=30) |
| 9 | Pairplot | sns.pairplot(data.frame, hue='Target') |
| 10 | Outlier check | Compare 75% vs max in .describe() |
03 · Sanity Check
Build Baseline Models First, Always
Before any real model, establish a floor. If your Linear Regression can't beat a dummy that always predicts the median, something is fundamentally broken in your pipeline.
python · dummy + permutation baselines
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import permutation_test_score
# Dummy baseline — predicts a constant (median/mean/quantile)
baseline = DummyRegressor(strategy='median')
baseline_cv = cross_validate(baseline, X_train, y_train,
cv=cv, scoring="neg_mean_absolute_error")
baseline_error = -1 * baseline_cv['test_score']
# Permutation baseline — shuffled labels, tests if features matter at all
score, perm_scores, pvalue = permutation_test_score(
my_pipeline, X_train, y_train,
cv=cv, scoring="neg_mean_absolute_error", n_permutations=30)
| Strategy | Predicts | When to use |
'median' | Median of training labels | Default go-to baseline |
'mean' | Mean of training labels | When using MSE as metric |
'constant' | Fixed value you specify | Domain-specific floor |
'quantile' | Any percentile | Asymmetric cost functions |
04 · Linear Models
Linear Regression & the Normal Equation
python · linear regression
from sklearn.linear_model import LinearRegression
lin_reg_pipeline = Pipeline([
("feature_scaling", StandardScaler()),
("lin_reg", LinearRegression())
])
cv_results = cross_validate(lin_reg_pipeline, X_train, y_train,
cv=cv, scoring="neg_mean_absolute_error",
return_train_score=True, return_estimator=True)
# Inspect learned weights (access via pipeline step name)
print(cv_results['estimator'][0]['lin_reg'].intercept_) # w_0
print(cv_results['estimator'][0]['lin_reg'].coef_) # w_1 ... w_m
# Weight stability across CV folds (low std = stable model)
coefs = [est['lin_reg'].coef_ for est in cv_results['estimator']]
weights_df = pd.DataFrame(coefs, columns=X_train.columns)
weights_df.plot.box(vert=False) # narrow boxes = stable
# OOF predictions for scatter plot diagnosis
preds = cross_val_predict(lin_reg_pipeline, X_train, y_train)
plt.scatter(y_train, preds)
plt.plot(y_train, y_train, 'r-') # diagonal = perfect
Typical Diagnosis
LinearRegression on raw features usually underfit. Train error ≈ Test error, both high. The fix is Polynomial features or switching to a tree-based model.
05 · Iterative Optimization
SGD Regressor — When Data is Large
SGDRegressor minimizes the same loss as LinearRegression but using gradient descent instead of the normal equation. It scales to massive datasets but needs careful configuration.
python · production SGD config
from sklearn.linear_model import SGDRegressor
max_iter = np.ceil(1e6 / X_train.shape[0]) # rule of thumb formula
sgd_pipeline = Pipeline([
("feature_scaling", StandardScaler()),
("sgd", SGDRegressor(
max_iter=max_iter,
early_stopping=True, # halt when val loss stops improving
eta0=1e-3, # tune this first via validation_curve
learning_rate='constant', # most reliable starting point
tol=1e-3,
validation_fraction=0.2,
n_iter_no_change=5,
average=10, # averaged SGD — more stable weights
random_state=42
))
])
Debug divergence: step-by-step training
python · loss curve diagnosis
# If loss explodes, eta0 is too large. If loss barely moves, too small.
sgd_debug = Pipeline([
("scaler", StandardScaler()),
("SGD", SGDRegressor(max_iter=1, tol=-np.infty,
warm_start=True, eta0=1e-3))
])
loss = []
for _ in range(100):
sgd_debug.fit(X_train, y_train)
loss.append(mean_squared_error(y_train, sgd_debug.predict(X_train)))
plt.plot(loss) # ideal: monotonically decreasing to plateau
Find the right learning rate with validation_curve
python · tune eta0
from sklearn.model_selection import validation_curve
eta_range = [1e-5, 1e-4, 1e-3, 1e-2]
train_s, test_s = validation_curve(
sgd_pipeline, X_train, y_train,
param_name="SGD__eta0", # format: "step_name__param"
param_range=eta_range, cv=cv,
scoring="neg_mean_squared_error")
plt.plot(eta_range, -test_s.mean(axis=1))
# Pick eta0 at the minimum test error
| learning_rate | Schedule | Best for |
'constant' | Fixed eta0 throughout | Most reliable — start here |
'adaptive' | Divides by 5 on plateau | Want automatic decay |
'invscaling' | eta = eta0 / t^power_t | Theoretical convergence guarantees |
06 · Feature Engineering
Polynomial Regression — Adding Curvature
python · polynomial pipeline
from sklearn.preprocessing import PolynomialFeatures
# Full polynomial (includes x1^2, x2^2, x1*x2, ...)
poly_pipeline = Pipeline([
("poly", PolynomialFeatures(degree=2)),
("scaler", StandardScaler()),
("lin_reg", LinearRegression())
])
# Interaction-only (just cross terms: x1*x2, skips x1^2)
poly_int = Pipeline([
("poly", PolynomialFeatures(degree=2, interaction_only=True)),
("scaler", StandardScaler()),
("lin_reg", LinearRegression())
])
# Find optimal degree with validation curve
degree_range = [1, 2, 3, 4, 5]
train_s, test_s = validation_curve(
poly_pipeline, X_train, y_train,
param_name="poly__degree",
param_range=degree_range,
cv=cv, scoring="neg_mean_absolute_error")
# Pick degree where test error is minimum — usually degree=2
⚠ Feature explosion warning
With 8 features: degree=2 → 45 features, degree=3 → 165 features. Always pair with Ridge or Lasso to prevent overfitting at higher degrees.
07 · Regularization
Ridge & Lasso — Taming Overfitting
Ridge — L2 (shrinks all weights)
Ridge
from sklearn.linear_model import Ridge, RidgeCV
# Manual alpha
Pipeline([
("poly", PolynomialFeatures(2)),
("scaler", StandardScaler()),
("ridge", Ridge(alpha=0.5))
])
# Auto-tune alpha
alpha_list = np.logspace(-4, 0, 20)
RidgeCV(alphas=alpha_list, store_cv_values=True)
Lasso — L1 (zeros out features)
Lasso
from sklearn.linear_model import Lasso, LassoCV
# Manual alpha
Pipeline([
("poly", PolynomialFeatures(2)),
("scaler", StandardScaler()),
("lasso", Lasso(alpha=0.01))
])
# Auto-tune alpha
alpha_list = np.logspace(-6, 0, 20)
LassoCV(alphas=alpha_list, cv=cv)
GridSearchCV — tune degree and alpha together
python · joint hyperparameter search
param_grid = {
'poly__degree': (1, 2, 3),
'ridge__alpha': np.logspace(-4, 0, 20) # 60 combinations total
}
ridge_search = GridSearchCV(ridge_pipeline, param_grid,
cv=cv, scoring="neg_mean_absolute_error",
n_jobs=2, return_train_score=True)
ridge_search.fit(X_train, y_train)
print(ridge_search.best_params_) # {'poly__degree': 2, 'ridge__alpha': 0.01}
print(-ridge_search.best_score_) # best CV error
| Ridge (L2) | Lasso (L1) |
| Penalty term | Σ w² | Σ |w| |
| Effect | Shrinks all weights toward 0 | Zeros out irrelevant weights |
| Feature selection? | No — keeps all features | Yes — built-in sparse model |
| Alpha = 0 | = LinearRegression |
| Alpha → ∞ | All weights → 0 |
| Use when | All features likely matter | Many features, want selection |
08 · Instance-Based
KNN Regressor — Nearest Neighbor Averaging
python · KNN with polynomial features
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler # use MinMax for distance-based models
# Basic KNN
knn_pipeline = Pipeline([
('scaler', MinMaxScaler()),
('knn', KNeighborsRegressor(n_neighbors=9))
])
# Find best K via GridSearchCV
params = {'knn__n_neighbors': list(range(1, 31))}
gs = GridSearchCV(knn_pipeline, params, cv=10).fit(X_train, y_train)
# Poly features + KNN (often helps a lot)
poly_knn = Pipeline([
('poly', PolynomialFeatures()),
('scaler', MinMaxScaler()),
('knn', KNeighborsRegressor())
])
params = {'poly__degree': [1,2,3], 'knn__n_neighbors': range(6,12)}
gs = GridSearchCV(poly_knn, params, cv=5).fit(X_train, y_train)
The K tradeoff
Small K → overfitting (too sensitive to individual points). Large K → underfitting (too smooth). Plot RMSE vs K on a validation set and look for the elbow — usually around K=8–12 for tabular data.
09 · Tree Models
Decision Tree Regressor — Interpretable Splits
python · decision tree + HPT + visualization
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
from sklearn.tree import export_text
dt_pipeline = Pipeline([
("scaler", StandardScaler()),
("dt_reg", DecisionTreeRegressor(max_depth=3, random_state=42))
])
# Tune max_depth + min_samples_split jointly
param_grid = {
'dt_reg__max_depth': range(1, 20),
'dt_reg__min_samples_split': range(2, 8)
}
dt_search = GridSearchCV(dt_pipeline, param_grid, cv=cv,
scoring="neg_mean_absolute_error")
dt_search.fit(X_train, y_train)
print(dt_search.best_params_)
# Retrain with best params in-place
dt_pipeline.set_params(
dt_reg__max_depth=11,
dt_reg__min_samples_split=5
).fit(X_train, y_train)
# Visualize — tree diagram
plt.figure(figsize=(28, 8))
tree.plot_tree(dt_pipeline[-1], feature_names=X_train.columns,
rounded=True, filled=True, fontsize=12)
# Visualize — text rules (for copy-paste into reports)
print(export_text(dt_pipeline[-1]))
10 · Ensemble · Bagging
Bagging & Random Forests — Averaging Out Variance
Bagging trains multiple models on random subsets and averages predictions. Random Forest goes further by also randomizing features at each split, creating more decorrelated trees.
BaggingRegressor
Bagging
from sklearn.ensemble import BaggingRegressor
BaggingRegressor(
n_estimators=100,
max_features=8,
max_samples=0.5,
bootstrap=False,
bootstrap_features=False
)
RandomForestRegressor ⭐
RandomForest
from sklearn.ensemble import RandomForestRegressor
RandomForestRegressor(
n_estimators=100,
max_features='sqrt',
min_samples_split=2,
bootstrap=True,
n_jobs=-1
)
RandomizedSearchCV for Random Forest
python · random search
param_distributions = {
"n_estimators": [1, 2, 5, 10, 20, 50, 100, 200, 500],
"max_leaf_nodes": [2, 5, 10, 20, 50, 100],
}
search = RandomizedSearchCV(
RandomForestRegressor(n_jobs=2), param_distributions,
n_iter=10, scoring="neg_mean_absolute_error", random_state=0
)
search.fit(X_train, y_train)
final_error = -search.score(X_test, y_test) # only now use test set
VotingRegressor — blend multiple model types
python · voting ensemble
from sklearn.ensemble import VotingRegressor
vr = VotingRegressor(estimators=[
('lr', LinearRegression()),
('dt', DecisionTreeRegressor()),
('knn', KNeighborsRegressor())
])
pipeline = Pipeline([("preprocessor", preprocessor), ("vr", vr)])
11 · Ensemble · Boosting
Boosting — Sequential Error Correction
Unlike bagging (parallel), boosting trains models sequentially where each model corrects the errors of the previous. This typically achieves the highest accuracy on tabular data.
python · all three boosting models
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor # pip install xgboost
# AdaBoost — reweights hard examples
AdaBoostRegressor(n_estimators=100, learning_rate=0.1, loss='linear')
# GradientBoosting — fits residuals
GradientBoostingRegressor(n_estimators=100, max_depth=5, learning_rate=0.1)
# XGBoost — regularized GBM, usually wins
XGBRegressor(
objective='reg:squarederror',
max_depth=5,
alpha=10, # L1 regularization
n_estimators=2000,
learning_rate=0.1,
colsample_bytree=1
)
Reusable helper for comparing all models
python · train_regressor helper
def train_regressor(estimator, X_train, y_train, cv, name):
cv_results = cross_validate(
estimator, X_train, y_train, cv=cv,
scoring="neg_mean_absolute_error", return_train_score=True)
train_err = -1 * cv_results['train_score']
test_err = -1 * cv_results['test_score']
print(f"[{name}]")
print(f" Train: {train_err.mean():.3f}k ± {train_err.std():.3f}k")
print(f" Test: {test_err.mean():.3f}k ± {test_err.std():.3f}k")
# Compare all boosters at once
for name, model in [
("AdaBoost", AdaBoostRegressor()),
("GradientBoost", GradientBoostingRegressor()),
("XGBoost", XGBRegressor(objective='reg:squarederror')),
]:
train_regressor(model, X_train, y_train, cv, name)
Typical performance ranking
12 · Neural Network
MLP Regressor — Sklearn Neural Network
python · MLPRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_percentage_error
pipe = Pipeline([
('scaler', StandardScaler()),
('regressor', MLPRegressor(
hidden_layer_sizes=(32, 32, 32), # 3 hidden layers, 32 neurons each
activation='relu',
solver='adam',
max_iter=500,
random_state=42
))
])
cv_results = cross_validate(pipe, X_train, y_train, cv=cv,
scoring="neg_mean_absolute_percentage_error",
return_train_score=True)
# Scatter plot: predicted vs actual
pipe.fit(X_train, y_train)
plt.plot(y_test, pipe.predict(X_test), 'b*')
plt.plot(y_test, y_test, 'g-') # perfect prediction line
Key hyperparameters
hidden_layer_sizes (architecture), alpha (L2 regularization strength), learning_rate_init, activation (relu/tanh/logistic). For serious neural networks, use PyTorch — MLPRegressor is great for quick experimentation.
13 · Optimization
Hyperparameter Tuning — The Right Tool for Each Job
GridSearchCV — exhaustive
Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {
'poly__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-4, 0, 20)
}
search = GridSearchCV(
pipeline, param_grid,
cv=cv, n_jobs=2,
scoring="neg_mean_absolute_error"
)
search.fit(X_train, y_train)
print(search.best_params_)
print(-search.best_score_)
RandomizedSearchCV — efficient
Random Search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, uniform
param_dist = {
'degree': [1, 2, 3],
'eta0': loguniform(1e-5, 1),
'l1_ratio': uniform(0, 1)
}
search = RandomizedSearchCV(
pipeline, param_dist,
n_iter=10, cv=cv
)
search.fit(X_train, y_train)
Extract results from any search object
python · result extraction
best_idx = search.best_index_
train_err = -1 * search.cv_results_['mean_train_score'][best_idx]
test_err = -1 * search.cv_results_['mean_test_score'][best_idx]
std_train = search.cv_results_['std_train_score'][best_idx]
std_test = search.cv_results_['std_test_score'][best_idx]
best_model = search.best_estimator_
y_pred = best_model.predict(X_test)
final_err = -search.score(X_test, y_test) # call this ONCE, at the very end
| Method | Best for | Notes |
GridSearchCV | Small, discrete param grids | Exhaustive, all combinations |
RandomizedSearchCV | Large / continuous spaces | Budget: set n_iter |
RidgeCV / LassoCV | Only tuning alpha for linear models | Fastest — built-in CV |
validation_curve | Understanding a single param's effect | Best for visualization and debugging |
14 · Evaluation
Cross Validation — Three Levels of Detail
python · choose your CV API
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
# Level 1: just the test scores
scores = cross_val_score(pipeline, X_train, y_train,
cv=cv, scoring='neg_mean_squared_error')
mse = -scores
# Level 2: test + train scores + estimators (use this most often)
results = cross_validate(pipeline, X_train, y_train, cv=cv,
scoring="neg_mean_absolute_error",
return_train_score=True,
return_estimator=True)
# Level 3: OOF predictions for scatter plots (no leakage)
oof_preds = cross_val_predict(pipeline, X_train, y_train)
plt.scatter(y_train, oof_preds) # residuals / actual vs predicted
| API | Returns | When to use |
cross_val_score | Array of test scores per fold | Quick performance check |
cross_validate | Dict: scores + estimators + times | Full diagnostic info |
cross_val_predict | OOF predictions (no train leakage) | Scatter plots, residual analysis |
Mixed data preprocessing with ColumnTransformer
python · for datasets with numeric + categorical columns
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
numeric_transformer = Pipeline([
("imputer", SimpleImputer(missing_values=-1, strategy="mean")),
("scaler", StandardScaler())
])
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_cols),
("cat", OneHotEncoder(), cat_cols),
("ord", OrdinalEncoder(), ordinal_cols),
])
# Plug into any model pipeline
Pipeline([("preprocessor", preprocessor), ("model", RandomForestRegressor())])
15 · Metrics
All Regression Metrics — In One Place
python · all metrics
from sklearn.metrics import (
mean_absolute_error, # MAE – interpretable in target units
mean_squared_error, # MSE – penalizes large errors more
mean_absolute_percentage_error, # MAPE – scale-free percentage
r2_score # R² – fraction of variance explained
)
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = mean_squared_error(y_true, y_pred, squared=False)
mape = mean_absolute_percentage_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
| Metric | Scoring string | Best value | Sensitive to outliers |
| MAE | neg_mean_absolute_error | 0 | No |
| MSE | neg_mean_squared_error | 0 | Yes |
| RMSE | neg_root_mean_squared_error | 0 | Yes |
| MAPE | neg_mean_absolute_percentage_error | 0 | No |
| R² | r2 | 1.0 | Medium |
16 · Diagnosis
Model Diagnosis — Reading the Signals
| What you see | Diagnosis | Fix |
| Train error ≈ Test error, both high |
Underfitting |
Add poly features, use more expressive model, reduce regularization |
| Train error low, Test error much higher |
Overfitting |
Add Ridge/Lasso, reduce tree depth, use ensemble, add data |
| Train error ≈ Test error, both low |
Good fit |
Ship it — then try XGBoost for marginal gains |
| High std in CV scores |
Unstable |
More CV folds, more data, ensemble methods |
Learning curves (the gold standard diagnostic)
python · learning curve
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
pipeline, X_train, y_train,
cv=cv, scoring='neg_mean_squared_error',
train_sizes=np.linspace(0.2, 1.0, 10),
return_times=True, n_jobs=-1)
plt.plot(train_sizes, -train_scores.mean(axis=1), 'r-o', label='Train error')
plt.plot(train_sizes, -test_scores.mean(axis=1), 'g-o', label='Val error')
plt.legend(); plt.xlabel('Training samples'); plt.ylabel('MSE')
Which model to reach for first?
| Situation | Start here |
| Small dataset, linear relationships | LinearRegression → Ridge |
| Need human-readable decision rules | DecisionTreeRegressor (shallow) |
| Large dataset, complex patterns | RandomForest → XGBoost |
| Kaggle / competition / best accuracy | XGBoost with RandomizedSearchCV |
| Mixed numeric + categorical features | Any model + ColumnTransformer |
| Very large N, can't fit in memory | SGDRegressor |
Always build a dummy baseline. Always use Pipeline. Always report train and test error together.
0 Comments