Pandas and Machine Learning: Preprocessing Techniques

Getting your data ready for machine learning can feel like gearing up for a space mission with pandas as your trusty spaceship. Let’s blast through the essential preprocessing steps.

Encoding Categorical Variables

Most ML algorithms love numbers, so those text categories need to become digits. Pandas makes this easy with get_dummies.

import pandas as pd

# Sample DataFrame with a categorical feature
df = pd.DataFrame({
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']
})

# Convert categorical variable into dummy/indicator variables
encoded_df = pd.get_dummies(df, columns=['Color'])
print(encoded_df)

Feature Scaling: Keeping Everything in Proportion

Feature scaling stops one large-scale feature from overshadowing the smaller ones, ensuring your model treats all features equally. Use Pandas alongside sklearn for this.

from sklearn.preprocessing import StandardScaler
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'Age': [25, 30, 35, 40, 45],
'Salary': [50000, 60000, 70000, 80000, 90000]
})

# Apply StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_scaled)

Handling Missing Values: Fill Them or Kill Them

Missing data can throw a wrench in your models. Fill in the gaps with fillna or drop them with dropna, depending on your strategy.

# Fill missing values with the mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Or drop rows with missing values
df.dropna(subset=['Age'], inplace=True)

Automating with Pipelines

Once you’ve got the hang of preprocessing, automate the process with sklearn’s Pipeline to streamline your workflow from raw data to ready-to-train.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Define preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, ['CategoricalColumn'])
])

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

Leave a Reply