Getting your data ready for machine learning can feel like gearing up for a space mission with pandas as your trusty spaceship. Let’s blast through the essential preprocessing steps.
Encoding Categorical Variables
Most ML algorithms love numbers, so those text categories need to become digits. Pandas makes this easy with get_dummies.
import pandas as pd # Sample DataFrame with a categorical feature df = pd.DataFrame({ 'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'] }) # Convert categorical variable into dummy/indicator variables encoded_df = pd.get_dummies(df, columns=['Color']) print(encoded_df)
Feature Scaling: Keeping Everything in Proportion
Feature scaling stops one large-scale feature from overshadowing the smaller ones, ensuring your model treats all features equally. Use Pandas alongside sklearn for this.
from sklearn.preprocessing import StandardScaler import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'Age': [25, 30, 35, 40, 45], 'Salary': [50000, 60000, 70000, 80000, 90000] }) # Apply StandardScaler scaler = StandardScaler() df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns) print(df_scaled)
Handling Missing Values: Fill Them or Kill Them
Missing data can throw a wrench in your models. Fill in the gaps with fillna or drop them with dropna, depending on your strategy.
# Fill missing values with the mean df['Age'] = df['Age'].fillna(df['Age'].mean()) # Or drop rows with missing values df.dropna(subset=['Age'], inplace=True)
Automating with Pipelines
Once you’ve got the hang of preprocessing, automate the process with sklearn’s Pipeline to streamline your workflow from raw data to ready-to-train.
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder # Define preprocessing for categorical data categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) # Combine preprocessing steps preprocessor = ColumnTransformer( transformers=[ ('cat', categorical_transformer, ['CategoricalColumn']) ]) # Create preprocessing and training pipeline pipeline = Pipeline(steps=[('preprocessor', preprocessor)])