The Complete Sklearn Regression Playbook

Machine Learning · Cheatsheet

The Complete Sklearn
Regression Playbook

Every regression model, pattern, and trick from linear to XGBoost — synthesized into one reference you can build from, not just read.

sklearn LinearRegression Ridge / Lasso SGD RandomForest XGBoost Pipeline CrossValidation

This post distills 12 Jupyter notebooks into a single actionable reference. Whether you're building your first regression model or debugging a stubborn overfit, every pattern you need is here — with runnable code, not vague advice.

Master the Pipeline pattern first. Everything else is a plugin.

01 · Foundation

The Universal Pipeline Pattern

Every single model in this guide follows the same skeleton. Sklearn's Pipeline chains preprocessing and modeling steps together so that data transformations are always fit on training data only — never leaking test information during cross-validation.

Raw Data

→

Preprocessor

→

Estimator

→

cross_validate

→

Report Errors

python · universal template

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
 
# ① Build: chain preprocessing + model
pipeline = Pipeline([
    ("feature_scaling", StandardScaler()),
    ("model",           YourRegressor())
])
 
# ② Train with cross-validation
cv_results = cross_validate(
    pipeline, X_train, y_train,
    cv=cv,
    scoring="neg_mean_absolute_error",
    return_train_score=True,
    return_estimator=True           # keep trained models per fold
)
 
# ③ Extract errors (scores are negative by sklearn convention)
train_error = -1 * cv_results['train_score']
test_error  = -1 * cv_results['test_score']
 
print(f"Train: {train_error.mean():.3f} ± {train_error.std():.3f}")
print(f"Test:  {test_error.mean():.3f} ± {test_error.std():.3f}")
 
# ④ Pick best model from CV folds
best_idx   = test_error.argmin()
best_model = cv_results['estimator'][best_idx]

✓ Why Pipeline prevents leakage Inside each CV fold, pipeline.fit(X_train_fold) fits the scaler on fold's training data only. The test fold is transformed but never seen during fitting. Without a Pipeline, you'd accidentally scale on all data.

⚠ The -1 trick sklearn uses "higher is better" universally. MAE is negative because minimizing error = maximizing negative error. Always multiply by -1 before reporting.

Access pipeline steps with pipeline[-1] (last step / model) or pipeline['step_name'] by name.

02 · Setup

Data Loading & Splitting

python · data setup

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, ShuffleSplit
 
# Load
features, labels = fetch_california_housing(as_frame=True, return_X_y=True)
labels *= 100          # scale to $k (optional, for readability)
 
# Split 1 — hold out test set. Touch it ONLY for final evaluation.
com_train_features, test_features, com_train_labels, test_labels = \
    train_test_split(features, labels, random_state=42)
 
# Split 2 — dev set for quick sanity checks during experiments
train_features, dev_features, train_labels, dev_labels = \
    train_test_split(com_train_features, com_train_labels, random_state=42)
 
# Cross-validation strategy (use com_train_* for all CV)
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

🚫 Golden Rule Never touch test_features during model development. It exists only for the single final report. All tuning, CV, and diagnostics go on com_train_*.

10-Step Dataset Exploration Checklist

#	Step	Code
1	Description	`print(data.DESCR)`
2	Feature shape	`features.shape`
3	Label shape	`labels.shape`
4	Feature names	`features.columns`
5	Sample rows	`data.frame.head()`
6	Dtypes / nulls	`data.frame.info()`
7	Statistics	`data.frame.describe()`
8	Histograms	`data.frame.hist(figsize=(12,10), bins=30)`
9	Pairplot	`sns.pairplot(data.frame, hue='Target')`
10	Outlier check	Compare `75%` vs `max` in `.describe()`

03 · Sanity Check

Build Baseline Models First, Always

Before any real model, establish a floor. If your Linear Regression can't beat a dummy that always predicts the median, something is fundamentally broken in your pipeline.

python · dummy + permutation baselines

from sklearn.dummy import DummyRegressor
from sklearn.model_selection import permutation_test_score
 
# Dummy baseline — predicts a constant (median/mean/quantile)
baseline = DummyRegressor(strategy='median')
baseline_cv = cross_validate(baseline, X_train, y_train,
                              cv=cv, scoring="neg_mean_absolute_error")
baseline_error = -1 * baseline_cv['test_score']
 
# Permutation baseline — shuffled labels, tests if features matter at all
score, perm_scores, pvalue = permutation_test_score(
    my_pipeline, X_train, y_train,
    cv=cv, scoring="neg_mean_absolute_error", n_permutations=30)

Strategy	Predicts	When to use
`'median'`	Median of training labels	Default go-to baseline
`'mean'`	Mean of training labels	When using MSE as metric
`'constant'`	Fixed value you specify	Domain-specific floor
`'quantile'`	Any percentile	Asymmetric cost functions

04 · Linear Models

Linear Regression & the Normal Equation

python · linear regression

from sklearn.linear_model import LinearRegression
 
lin_reg_pipeline = Pipeline([
    ("feature_scaling", StandardScaler()),
    ("lin_reg",         LinearRegression())
])
 
cv_results = cross_validate(lin_reg_pipeline, X_train, y_train,
                             cv=cv, scoring="neg_mean_absolute_error",
                             return_train_score=True, return_estimator=True)
 
# Inspect learned weights (access via pipeline step name)
print(cv_results['estimator'][0]['lin_reg'].intercept_)  # w_0
print(cv_results['estimator'][0]['lin_reg'].coef_)       # w_1 ... w_m
 
# Weight stability across CV folds (low std = stable model)
coefs = [est['lin_reg'].coef_ for est in cv_results['estimator']]
weights_df = pd.DataFrame(coefs, columns=X_train.columns)
weights_df.plot.box(vert=False)   # narrow boxes = stable
 
# OOF predictions for scatter plot diagnosis
preds = cross_val_predict(lin_reg_pipeline, X_train, y_train)
plt.scatter(y_train, preds)
plt.plot(y_train, y_train, 'r-')   # diagonal = perfect

Typical Diagnosis LinearRegression on raw features usually underfit. Train error ≈ Test error, both high. The fix is Polynomial features or switching to a tree-based model.

05 · Iterative Optimization

SGD Regressor — When Data is Large

SGDRegressor minimizes the same loss as LinearRegression but using gradient descent instead of the normal equation. It scales to massive datasets but needs careful configuration.

python · production SGD config

from sklearn.linear_model import SGDRegressor
 
max_iter = np.ceil(1e6 / X_train.shape[0])  # rule of thumb formula
 
sgd_pipeline = Pipeline([
    ("feature_scaling", StandardScaler()),
    ("sgd", SGDRegressor(
        max_iter=max_iter,
        early_stopping=True,        # halt when val loss stops improving
        eta0=1e-3,                    # tune this first via validation_curve
        learning_rate='constant',   # most reliable starting point
        tol=1e-3,
        validation_fraction=0.2,
        n_iter_no_change=5,
        average=10,                  # averaged SGD — more stable weights
        random_state=42
    ))
])

Debug divergence: step-by-step training

python · loss curve diagnosis

# If loss explodes, eta0 is too large. If loss barely moves, too small.
sgd_debug = Pipeline([
    ("scaler", StandardScaler()),
    ("SGD",    SGDRegressor(max_iter=1, tol=-np.infty,
                              warm_start=True, eta0=1e-3))
])
 
loss = []
for _ in range(100):
    sgd_debug.fit(X_train, y_train)
    loss.append(mean_squared_error(y_train, sgd_debug.predict(X_train)))
 
plt.plot(loss)  # ideal: monotonically decreasing to plateau

Find the right learning rate with validation_curve

python · tune eta0

from sklearn.model_selection import validation_curve
 
eta_range = [1e-5, 1e-4, 1e-3, 1e-2]
train_s, test_s = validation_curve(
    sgd_pipeline, X_train, y_train,
    param_name="SGD__eta0",          # format: "step_name__param"
    param_range=eta_range, cv=cv,
    scoring="neg_mean_squared_error")
 
plt.plot(eta_range, -test_s.mean(axis=1))
# Pick eta0 at the minimum test error

learning_rate	Schedule	Best for
`'constant'`	Fixed `eta0` throughout	Most reliable — start here
`'adaptive'`	Divides by 5 on plateau	Want automatic decay
`'invscaling'`	`eta = eta0 / t^power_t`	Theoretical convergence guarantees

06 · Feature Engineering

Polynomial Regression — Adding Curvature

python · polynomial pipeline

from sklearn.preprocessing import PolynomialFeatures
 
# Full polynomial (includes x1^2, x2^2, x1*x2, ...)
poly_pipeline = Pipeline([
    ("poly",    PolynomialFeatures(degree=2)),
    ("scaler",  StandardScaler()),
    ("lin_reg", LinearRegression())
])
 
# Interaction-only (just cross terms: x1*x2, skips x1^2)
poly_int = Pipeline([
    ("poly",    PolynomialFeatures(degree=2, interaction_only=True)),
    ("scaler",  StandardScaler()),
    ("lin_reg", LinearRegression())
])
 
# Find optimal degree with validation curve
degree_range = [1, 2, 3, 4, 5]
train_s, test_s = validation_curve(
    poly_pipeline, X_train, y_train,
    param_name="poly__degree",
    param_range=degree_range,
    cv=cv, scoring="neg_mean_absolute_error")
 
# Pick degree where test error is minimum — usually degree=2

⚠ Feature explosion warning With 8 features: degree=2 → 45 features, degree=3 → 165 features. Always pair with Ridge or Lasso to prevent overfitting at higher degrees.

07 · Regularization

Ridge & Lasso — Taming Overfitting

Ridge — L2 (shrinks all weights)

Ridge

from sklearn.linear_model import Ridge, RidgeCV
 
# Manual alpha
Pipeline([
  ("poly",   PolynomialFeatures(2)),
  ("scaler", StandardScaler()),
  ("ridge",  Ridge(alpha=0.5))
])
 
# Auto-tune alpha
alpha_list = np.logspace(-4, 0, 20)
RidgeCV(alphas=alpha_list, store_cv_values=True)

Lasso — L1 (zeros out features)

Lasso

from sklearn.linear_model import Lasso, LassoCV
 
# Manual alpha
Pipeline([
  ("poly",   PolynomialFeatures(2)),
  ("scaler", StandardScaler()),
  ("lasso",  Lasso(alpha=0.01))
])
 
# Auto-tune alpha
alpha_list = np.logspace(-6, 0, 20)
LassoCV(alphas=alpha_list, cv=cv)

GridSearchCV — tune degree and alpha together

python · joint hyperparameter search

param_grid = {
    'poly__degree': (1, 2, 3),
    'ridge__alpha': np.logspace(-4, 0, 20)     # 60 combinations total
}
ridge_search = GridSearchCV(ridge_pipeline, param_grid,
                             cv=cv, scoring="neg_mean_absolute_error",
                             n_jobs=2, return_train_score=True)
ridge_search.fit(X_train, y_train)
print(ridge_search.best_params_)     # {'poly__degree': 2, 'ridge__alpha': 0.01}
print(-ridge_search.best_score_)     # best CV error

	Ridge (L2)	Lasso (L1)
Penalty term	Σ w²	Σ \|w\|
Effect	Shrinks all weights toward 0	Zeros out irrelevant weights
Feature selection?	No — keeps all features	Yes — built-in sparse model
Alpha = 0	= LinearRegression
Alpha → ∞	All weights → 0
Use when	All features likely matter	Many features, want selection

08 · Instance-Based

KNN Regressor — Nearest Neighbor Averaging

python · KNN with polynomial features

from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler  # use MinMax for distance-based models
 
# Basic KNN
knn_pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('knn',    KNeighborsRegressor(n_neighbors=9))
])
 
# Find best K via GridSearchCV
params = {'knn__n_neighbors': list(range(1, 31))}
gs = GridSearchCV(knn_pipeline, params, cv=10).fit(X_train, y_train)
 
# Poly features + KNN (often helps a lot)
poly_knn = Pipeline([
    ('poly',   PolynomialFeatures()),
    ('scaler', MinMaxScaler()),
    ('knn',    KNeighborsRegressor())
])
params = {'poly__degree': [1,2,3], 'knn__n_neighbors': range(6,12)}
gs = GridSearchCV(poly_knn, params, cv=5).fit(X_train, y_train)

The K tradeoff Small K → overfitting (too sensitive to individual points). Large K → underfitting (too smooth). Plot RMSE vs K on a validation set and look for the elbow — usually around K=8–12 for tabular data.

09 · Tree Models

Decision Tree Regressor — Interpretable Splits

python · decision tree + HPT + visualization

from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
from sklearn.tree import export_text
 
dt_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("dt_reg", DecisionTreeRegressor(max_depth=3, random_state=42))
])
 
# Tune max_depth + min_samples_split jointly
param_grid = {
    'dt_reg__max_depth':         range(1, 20),
    'dt_reg__min_samples_split': range(2, 8)
}
dt_search = GridSearchCV(dt_pipeline, param_grid, cv=cv,
                          scoring="neg_mean_absolute_error")
dt_search.fit(X_train, y_train)
print(dt_search.best_params_)
 
# Retrain with best params in-place
dt_pipeline.set_params(
    dt_reg__max_depth=11,
    dt_reg__min_samples_split=5
).fit(X_train, y_train)
 
# Visualize — tree diagram
plt.figure(figsize=(28, 8))
tree.plot_tree(dt_pipeline[-1], feature_names=X_train.columns,
               rounded=True, filled=True, fontsize=12)
 
# Visualize — text rules (for copy-paste into reports)
print(export_text(dt_pipeline[-1]))

10 · Ensemble · Bagging

Bagging & Random Forests — Averaging Out Variance

Bagging trains multiple models on random subsets and averages predictions. Random Forest goes further by also randomizing features at each split, creating more decorrelated trees.

BaggingRegressor

Bagging

from sklearn.ensemble import BaggingRegressor
 
BaggingRegressor(
    n_estimators=100,
    max_features=8,
    max_samples=0.5,
    bootstrap=False,
    bootstrap_features=False
)

RandomForestRegressor ⭐

RandomForest

from sklearn.ensemble import RandomForestRegressor
 
RandomForestRegressor(
    n_estimators=100,
    max_features='sqrt',
    min_samples_split=2,
    bootstrap=True,
    n_jobs=-1
)

RandomizedSearchCV for Random Forest

python · random search

param_distributions = {
    "n_estimators":  [1, 2, 5, 10, 20, 50, 100, 200, 500],
    "max_leaf_nodes": [2, 5, 10, 20, 50, 100],
}
search = RandomizedSearchCV(
    RandomForestRegressor(n_jobs=2), param_distributions,
    n_iter=10, scoring="neg_mean_absolute_error", random_state=0
)
search.fit(X_train, y_train)
final_error = -search.score(X_test, y_test)   # only now use test set

VotingRegressor — blend multiple model types

python · voting ensemble

from sklearn.ensemble import VotingRegressor
 
vr = VotingRegressor(estimators=[
    ('lr',  LinearRegression()),
    ('dt',  DecisionTreeRegressor()),
    ('knn', KNeighborsRegressor())
])
pipeline = Pipeline([("preprocessor", preprocessor), ("vr", vr)])

11 · Ensemble · Boosting

Boosting — Sequential Error Correction

Unlike bagging (parallel), boosting trains models sequentially where each model corrects the errors of the previous. This typically achieves the highest accuracy on tabular data.

python · all three boosting models

from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor    # pip install xgboost
 
# AdaBoost — reweights hard examples
AdaBoostRegressor(n_estimators=100, learning_rate=0.1, loss='linear')
 
# GradientBoosting — fits residuals
GradientBoostingRegressor(n_estimators=100, max_depth=5, learning_rate=0.1)
 
# XGBoost — regularized GBM, usually wins
XGBRegressor(
    objective='reg:squarederror',
    max_depth=5,
    alpha=10,          # L1 regularization
    n_estimators=2000,
    learning_rate=0.1,
    colsample_bytree=1
)

Reusable helper for comparing all models

python · train_regressor helper

def train_regressor(estimator, X_train, y_train, cv, name):
    cv_results = cross_validate(
        estimator, X_train, y_train, cv=cv,
        scoring="neg_mean_absolute_error", return_train_score=True)
    train_err = -1 * cv_results['train_score']
    test_err  = -1 * cv_results['test_score']
    print(f"[{name}]")
    print(f"  Train: {train_err.mean():.3f}k ± {train_err.std():.3f}k")
    print(f"  Test:  {test_err.mean():.3f}k ± {test_err.std():.3f}k")
 
# Compare all boosters at once
for name, model in [
    ("AdaBoost",       AdaBoostRegressor()),
    ("GradientBoost",  GradientBoostingRegressor()),
    ("XGBoost",        XGBRegressor(objective='reg:squarederror')),
]:
    train_regressor(model, X_train, y_train, cv, name)

Typical performance ranking

XGBoost

Best

GradientBoosting

~same

RandomForest

Strong

AdaBoost

Decent

Decision Tree

Baseline

LinearRegression

Weak

12 · Neural Network

MLP Regressor — Sklearn Neural Network

python · MLPRegressor

from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_percentage_error
 
pipe = Pipeline([
    ('scaler',    StandardScaler()),
    ('regressor', MLPRegressor(
        hidden_layer_sizes=(32, 32, 32),   # 3 hidden layers, 32 neurons each
        activation='relu',
        solver='adam',
        max_iter=500,
        random_state=42
    ))
])
 
cv_results = cross_validate(pipe, X_train, y_train, cv=cv,
                             scoring="neg_mean_absolute_percentage_error",
                             return_train_score=True)
 
# Scatter plot: predicted vs actual
pipe.fit(X_train, y_train)
plt.plot(y_test, pipe.predict(X_test), 'b*')
plt.plot(y_test, y_test, 'g-')   # perfect prediction line

Key hyperparameters hidden_layer_sizes (architecture), alpha (L2 regularization strength), learning_rate_init, activation (relu/tanh/logistic). For serious neural networks, use PyTorch — MLPRegressor is great for quick experimentation.

13 · Optimization

Hyperparameter Tuning — The Right Tool for Each Job

GridSearchCV — exhaustive

Grid Search

from sklearn.model_selection import GridSearchCV
 
param_grid = {
    'poly__degree': [1, 2, 3],
    'ridge__alpha': np.logspace(-4, 0, 20)
}
search = GridSearchCV(
    pipeline, param_grid,
    cv=cv, n_jobs=2,
    scoring="neg_mean_absolute_error"
)
search.fit(X_train, y_train)
print(search.best_params_)
print(-search.best_score_)

RandomizedSearchCV — efficient

Random Search

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, uniform
 
param_dist = {
    'degree': [1, 2, 3],
    'eta0': loguniform(1e-5, 1),
    'l1_ratio': uniform(0, 1)
}
search = RandomizedSearchCV(
    pipeline, param_dist,
    n_iter=10, cv=cv
)
search.fit(X_train, y_train)

Extract results from any search object

python · result extraction

best_idx = search.best_index_
train_err = -1 * search.cv_results_['mean_train_score'][best_idx]
test_err  = -1 * search.cv_results_['mean_test_score'][best_idx]
std_train = search.cv_results_['std_train_score'][best_idx]
std_test  = search.cv_results_['std_test_score'][best_idx]
 
best_model = search.best_estimator_
y_pred = best_model.predict(X_test)
final_err = -search.score(X_test, y_test)   # call this ONCE, at the very end

Method	Best for	Notes
`GridSearchCV`	Small, discrete param grids	Exhaustive, all combinations
`RandomizedSearchCV`	Large / continuous spaces	Budget: set `n_iter`
`RidgeCV` / `LassoCV`	Only tuning alpha for linear models	Fastest — built-in CV
`validation_curve`	Understanding a single param's effect	Best for visualization and debugging

14 · Evaluation

Cross Validation — Three Levels of Detail

python · choose your CV API

from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
 
# Level 1: just the test scores
scores = cross_val_score(pipeline, X_train, y_train,
                          cv=cv, scoring='neg_mean_squared_error')
mse = -scores
 
# Level 2: test + train scores + estimators (use this most often)
results = cross_validate(pipeline, X_train, y_train, cv=cv,
                           scoring="neg_mean_absolute_error",
                           return_train_score=True,
                           return_estimator=True)
 
# Level 3: OOF predictions for scatter plots (no leakage)
oof_preds = cross_val_predict(pipeline, X_train, y_train)
plt.scatter(y_train, oof_preds)   # residuals / actual vs predicted

API	Returns	When to use
`cross_val_score`	Array of test scores per fold	Quick performance check
`cross_validate`	Dict: scores + estimators + times	Full diagnostic info
`cross_val_predict`	OOF predictions (no train leakage)	Scatter plots, residual analysis

Mixed data preprocessing with ColumnTransformer

python · for datasets with numeric + categorical columns

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
 
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(missing_values=-1, strategy="mean")),
    ("scaler",  StandardScaler())
])
 
preprocessor = ColumnTransformer([
    ("num", numeric_transformer,      numeric_cols),
    ("cat", OneHotEncoder(),           cat_cols),
    ("ord", OrdinalEncoder(),          ordinal_cols),
])
 
# Plug into any model pipeline
Pipeline([("preprocessor", preprocessor), ("model", RandomForestRegressor())])

15 · Metrics

All Regression Metrics — In One Place

python · all metrics

from sklearn.metrics import (
    mean_absolute_error,                # MAE  – interpretable in target units
    mean_squared_error,                 # MSE  – penalizes large errors more
    mean_absolute_percentage_error,     # MAPE – scale-free percentage
    r2_score                            # R²   – fraction of variance explained
)
 
mae  = mean_absolute_error(y_true, y_pred)
mse  = mean_squared_error(y_true, y_pred)
rmse = mean_squared_error(y_true, y_pred, squared=False)
mape = mean_absolute_percentage_error(y_true, y_pred)
r2   = r2_score(y_true, y_pred)

Metric	Scoring string	Best value	Sensitive to outliers
MAE	`neg_mean_absolute_error`	0	No
MSE	`neg_mean_squared_error`	0	Yes
RMSE	`neg_root_mean_squared_error`	0	Yes
MAPE	`neg_mean_absolute_percentage_error`	0	No
R²	`r2`	1.0	Medium

16 · Diagnosis

Model Diagnosis — Reading the Signals

What you see	Diagnosis	Fix
Train error ≈ Test error, both high	Underfitting	Add poly features, use more expressive model, reduce regularization
Train error low, Test error much higher	Overfitting	Add Ridge/Lasso, reduce tree depth, use ensemble, add data
Train error ≈ Test error, both low	Good fit	Ship it — then try XGBoost for marginal gains
High std in CV scores	Unstable	More CV folds, more data, ensemble methods

Learning curves (the gold standard diagnostic)

python · learning curve

from sklearn.model_selection import learning_curve
 
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
    pipeline, X_train, y_train,
    cv=cv, scoring='neg_mean_squared_error',
    train_sizes=np.linspace(0.2, 1.0, 10),
    return_times=True, n_jobs=-1)
 
plt.plot(train_sizes, -train_scores.mean(axis=1), 'r-o', label='Train error')
plt.plot(train_sizes, -test_scores.mean(axis=1),  'g-o', label='Val error')
plt.legend(); plt.xlabel('Training samples'); plt.ylabel('MSE')

Which model to reach for first?

Situation	Start here
Small dataset, linear relationships	LinearRegression → Ridge
Need human-readable decision rules	DecisionTreeRegressor (shallow)
Large dataset, complex patterns	RandomForest → XGBoost
Kaggle / competition / best accuracy	XGBoost with RandomizedSearchCV
Mixed numeric + categorical features	Any model + ColumnTransformer
Very large N, can't fit in memory	SGDRegressor

Always build a dummy baseline. Always use Pipeline. Always report train and test error together.

Regression Models | CheatSheet

The Universal Pipeline Pattern

Data Loading & Splitting

10-Step Dataset Exploration Checklist

Build Baseline Models First, Always

Linear Regression & the Normal Equation

SGD Regressor — When Data is Large

Debug divergence: step-by-step training

Find the right learning rate with validation_curve

Polynomial Regression — Adding Curvature

Ridge & Lasso — Taming Overfitting

Ridge — L2 (shrinks all weights)

Lasso — L1 (zeros out features)

GridSearchCV — tune degree and alpha together

KNN Regressor — Nearest Neighbor Averaging

Decision Tree Regressor — Interpretable Splits

Bagging & Random Forests — Averaging Out Variance

BaggingRegressor

RandomForestRegressor ⭐

RandomizedSearchCV for Random Forest

VotingRegressor — blend multiple model types

Boosting — Sequential Error Correction

Reusable helper for comparing all models

Typical performance ranking

MLP Regressor — Sklearn Neural Network

Hyperparameter Tuning — The Right Tool for Each Job

GridSearchCV — exhaustive

RandomizedSearchCV — efficient

Extract results from any search object

Cross Validation — Three Levels of Detail

Mixed data preprocessing with ColumnTransformer

All Regression Metrics — In One Place

Model Diagnosis — Reading the Signals

Learning curves (the gold standard diagnostic)

Which model to reach for first?

Posted by Sahil Bind

You may like these posts

Post a Comment

0 Comments

About Me

Search This Blog

Social Plugin

Most Popular

10 BOOKS THAT WILL SHAPE YOUR YEAR

Regression Models | CheatSheet

Random Posts

Most Popular

10 BOOKS THAT WILL SHAPE YOUR YEAR

Regression Models | CheatSheet

Menu Footer Widget

Contact form