Data Preprocessing Best Practices for ML

Data preprocessing is often said to consume 80% of a data scientist's time, yet it's the foundation of any successful machine learning project. Poor data quality leads to poor model performance.

The Complete Data Preprocessing Pipeline

Data Collection & Understanding
Data Cleaning
Feature Engineering
Feature Scaling
Feature Selection

1. Handling Missing Values

import pandas as pd
import numpy as np

# Check missing values
print(df.isnull().sum())

# Drop rows with missing values
df_clean = df.dropna()

# Fill with mean/median/mode
df['age'].fillna(df['age'].median(), inplace=True)

# Forward fill for time series
df['price'].fillna(method='ffill', inplace=True)

# Advanced: Multiple Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(random_state=0)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

2. Handling Outliers

# IQR Method
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_clean = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)]

# Z-Score Method
from scipy import stats
z_scores = np.abs(stats.zscore(df['price']))
df_clean = df[z_scores < 3]

3. Feature Encoding

# Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['category'])

# Target Encoding (for high cardinality)
category_means = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(category_means)

4. Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Normalization (range 0-1)
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

# Robust Scaling (uses median, robust to outliers)
scaler = RobustScaler()
X_robust = scaler.fit_transform(X)

Common Pitfalls to Avoid

Data Leakage: Never use test data information during preprocessing
Overfitting: Don't create too many features without validation
Ignoring Class Imbalance: Use SMOTE, undersampling, or class weights
Not Saving Preprocessors: Always save scalers and encoders for production

Back to Portfolio