Back to Portfolio

Data Preprocessing Best Practices for ML

Data Science Python November 2024

Data preprocessing is often said to consume 80% of a data scientist's time, yet it's the foundation of any successful machine learning project. Poor data quality leads to poor model performance.

Data Preprocessing Pipeline

The Complete Data Preprocessing Pipeline

  1. Data Collection & Understanding
  2. Data Cleaning
  3. Feature Engineering
  4. Feature Scaling
  5. Feature Selection

1. Handling Missing Values

import pandas as pd
import numpy as np

# Check missing values
print(df.isnull().sum())

# Drop rows with missing values
df_clean = df.dropna()

# Fill with mean/median/mode
df['age'].fillna(df['age'].median(), inplace=True)

# Forward fill for time series
df['price'].fillna(method='ffill', inplace=True)

# Advanced: Multiple Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(random_state=0)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

2. Handling Outliers

# IQR Method
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_clean = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)]

# Z-Score Method
from scipy import stats
z_scores = np.abs(stats.zscore(df['price']))
df_clean = df[z_scores < 3]

3. Feature Encoding

# Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['category'])

# Target Encoding (for high cardinality)
category_means = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(category_means)

4. Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Normalization (range 0-1)
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

# Robust Scaling (uses median, robust to outliers)
scaler = RobustScaler()
X_robust = scaler.fit_transform(X)

Common Pitfalls to Avoid

Back to Portfolio