Pandas apply: Transform Data with Functions Complete Guide

What is apply()?

The apply() method applies a function along an axis (rows or columns) of a DataFrame. It’s a powerful tool for data transformation when built-in methods aren’t sufficient.

When to use apply():

  • Transform data with custom logic that pandas doesn’t provide
  • Apply same operation to every row or column
  • Conditional transformations based on multiple columns
  • Convert data types or formats
  • Create new calculated columns

Key variants:

  • apply(): Apply function to rows or columns of DataFrame
  • applymap(): Apply function to each element (deprecated in pandas 2.1, use map())
  • map(): Apply function to Series elements
⚠️ Performance Warning: apply() can be slow on large datasets. Vectorization is almost always faster!

 

Basic Syntax and Usage

import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 75000, 60000, 55000]
})

print("Original DataFrame:")
print(df)

# Basic syntax:
# df.apply(function, axis=0) # axis=0: columns, axis=1: rows

Understanding axis parameter

# axis=0: Apply function to each column (top to bottom)
# axis=1: Apply function to each row (left to right)

# Example: Get column maximums
df.apply(lambda x: x.max(), axis=0)

# Example: Get row sums
df.apply(lambda x: x.sum(), axis=1)

Apply to Columns (axis=0)

Simple Column Operation

# Get the maximum value for each column
result = df.apply(lambda x: x.max())
print(result)

Output:

Name             David
Age                 40
Salary          75000
dtype: object

Multiple Statistics

# Get min and max for numeric columns
df.apply(lambda x: x.max() - x.min() if x.dtype in ['int64', 'float64'] else 'N/A')

Column-Specific Functions

# Different function per column
def process_column(series):
    if series.dtype == 'object': # String column
        return series.str.upper()
    else: # Numeric column
        return series * 1.1 # Increase by 10%

result = df.apply(process_column)
print(result)

Apply to Rows (axis=1)

Simple Row Operation

# Sum all numeric columns for each row
df['Total'] = df.apply(lambda row: row['Age'] + row['Salary'] if isinstance(row['Salary'], (int, float)) else 0, axis=1)
print(df)

Create New Columns from Multiple Existing Columns

# Create 'Senior' column based on Age
df['Senior'] = df.apply(lambda row: row['Age'] > 35, axis=1)

# Create 'Status' based on multiple conditions
df['Status'] = df.apply(
    lambda row: 'High Earner' if row['Salary'] > 60000 else 'Standard',
    axis=1
)
print(df)

Output:

     Name  Age  Salary  Senior       Status
0   Alice   25   50000   False      Standard
1     Bob   30   75000   False   High Earner
2 Charlie   35   60000   False   High Earner
3   David   40   55000    True      Standard

Row Maximum Value

# Find maximum value in each row (numeric columns only)
df_numeric = df[['Age', 'Salary']]
max_per_row = df_numeric.apply(lambda row: row.max(), axis=1)
print(max_per_row)

Lambda Functions Explained

Lambda is a shorthand way to create anonymous functions. Essential for apply()!

Lambda Syntax

# Basic lambda
lambda x: x * 2

# Lambda with multiple arguments
lambda x, y: x + y

# Lambda with conditions
lambda x: 'Even' if x % 2 == 0 else 'Odd'

# In apply() context:
df['Age'].apply(lambda x: x * 2) # Double each age

Lambda vs Named Function

# Using lambda (concise)
df['Age'].apply(lambda x: x ** 2)

# Using named function (clearer for complex logic)
def square(x):
    return x ** 2

df['Age'].apply(square) # Same result

Lambda with Conditions

# Convert age to age group
df['Age_Group'] = df['Age'].apply(
    lambda x: 'Young' if x < 30 else ('Middle' if x < 40 else 'Senior')
)
print(df)

Lambda with Multiple Conditions

# Complex categorization
df['Category'] = df.apply(
    lambda row: 'High Pay Senior' if row['Salary'] > 60000 and row['Age'] > 35
    else ('High Pay Junior' if row['Salary'] > 60000 else 'Standard'),
    axis=1
)

💡 Lambda Tips:

  • Lambda is great for short, simple operations
  • Use def for complex multi-line logic
  • Lambda functions have access to variables in scope
  • Can’t include statements, only expressions

Custom Functions with apply()

Single Parameter Functions

# Define custom function
def classify_age(age):
    if age < 25:
        return 'Young'
    elif age < 35:
        return 'Mid-Career'
    else:
        return 'Senior'

# Apply to column
df['Age_Category'] = df['Age'].apply(classify_age)
print(df)

Multi-Parameter Functions with args

# Function with additional parameters
def scale_salary(salary, multiplier=1.1):
    return salary * multiplier

# Apply with args parameter
df['Scaled_Salary'] = df['Salary'].apply(scale_salary, args=(1.2,))

# Or with kwargs
df['Scaled_Salary'] = df['Salary'].apply(scale_salary, multiplier=1.15)

Row-Based Custom Function

# Function using entire row
def calculate_bonus(row):
    base_bonus = row['Salary'] * 0.1 # 10% base
    if row['Age'] > 30:
        base_bonus *= 1.2 # 20% extra for experience
    return base_bonus

df['Bonus'] = df.apply(calculate_bonus, axis=1)
print(df[['Name', 'Salary', 'Bonus']])

Apply with Multiple Columns

Access Multiple Columns in apply()

# Create salary-to-age ratio
df['Salary_Per_Year_Age'] = df.apply(
    lambda row: row['Salary'] / row['Age'],
    axis=1
)

# String concatenation from multiple columns
df['Full_Info'] = df.apply(
    lambda row: f"{row['Name']} ({row['Age']} years old)",
    axis=1
)
print(df)

Conditional Logic Across Columns

# Complex logic using multiple columns
df['Performance'] = df.apply(
    lambda row: 'Excellent' if row['Salary'] > 70000 and row['Age'] > 30
    else ('Good' if row['Salary'] > 50000 else 'Average'),
    axis=1
)

Apply to Subset of Columns

# Apply only to numeric columns
numeric_cols = df.select_dtypes(include=['number'])
result = numeric_cols.apply(lambda x: x.mean())
print(result)

applymap vs apply vs map

Method Target Use Case Example
apply() Rows/Columns Transform along axis df.apply(lambda x: x.sum())
map() Series elements Transform individual values df[‘col’].map(lambda x: x*2)
applymap() DataFrame elements Format all values df.applymap(lambda x: f'{x:.2f}’)

map() for Series

# Map with Series
status_map = {25: 'Young', 30: 'Mid', 35: 'Senior'}
df['Status'] = df['Age'].map(status_map)

# Map with function
df['Age_Double'] = df['Age'].map(lambda x: x * 2)

applymap() for DataFrame (Element-wise)

# Format all numeric values to 2 decimal places
df_formatted = df.applymap(lambda x: f'{x:.2f}' if isinstance(x, (int, float)) else x)
💡 Note: In pandas 2.1+, applymap() is deprecated. Use map() for Series or DataFrame.map() for DataFrames instead.

Vectorization – The Better Alternative

For most cases, vectorized operations are 10-100x faster than apply()!

Simple Arithmetic

# ❌ SLOW - using apply with lambda
df['Salary_Double'] = df['Salary'].apply(lambda x: x * 2)

# ✅ FAST - vectorized operation
df['Salary_Double'] = df['Salary'] * 2

String Operations

# ❌ SLOW - using apply
df['Name_Upper'] = df['Name'].apply(lambda x: x.upper())

# ✅ FAST - vectorized
df['Name_Upper'] = df['Name'].str.upper()

Conditional Logic with np.select()

import numpy as np

# ❌ SLOW - using apply
df['Status'] = df['Salary'].apply(
    lambda x: 'High' if x > 60000 else 'Low'
)

# ✅ FAST - vectorized with np.where
df['Status'] = np.where(df['Salary'] > 60000, 'High', 'Low')

# ✅ Even better for multiple conditions - np.select
conditions = [df['Salary'] > 70000, df['Salary'] > 50000, df['Salary'] <= 50000]
values = ['Very High', 'High', 'Low']
df['Status'] = np.select(conditions, values)

When to Avoid apply()

# ❌ DON'T use apply for these (too slow):
df.apply(lambda x: x * 2) # Use df * 2
df.apply(lambda x: x.sum()) # Use df.sum()
df.apply(lambda x: x > 5) # Use df > 5
df['Col'].apply(lambda x: x.upper()) # Use df['Col'].str.upper()

# ✅ OK to use apply() for:
df.apply(custom_complex_function) # When custom logic can't be vectorized
df.apply(lambda row: row_specific_logic, axis=1) # Row-specific calculations

Performance Optimization

🚀 Speed Up apply()

1. Use Vectorization First

# Benchmark: 1M rows
# apply(): ~50 seconds
# Vectorized: ~0.5 seconds

# 100x faster with vectorization!

2. Use numba for Complex Calculations

from numba import jit

@jit(nopython=True)
def complex_calc(x):
    result = 0
    for i in range(x):
        result += i ** 2
    return result

# Much faster than apply()

3. Use Cython for Critical Code

# For functions called millions of times,
# consider Cython compilation (advanced)

4. Minimize I/O in Functions

# ❌ SLOW - function calls other objects
def slow_func(x):
    return x * global_variable

# ✅ FAST - function is self-contained
df.apply(lambda x: x * 2)

Common Mistakes to Avoid

⚠️ Mistake #1: Using apply() When Vectorization Works

# ❌ WRONG - 100x slower
df['Result'] = df['Salary'].apply(lambda x: x * 1.1)

# ✅ CORRECT - vectorized
df['Result'] = df['Salary'] * 1.1

⚠️ Mistake #2: Forgetting axis Parameter

# ❌ WRONG - applies to columns (confusing result)
result = df.apply(lambda x: x + 1) # axis=0 by default

# ✅ CORRECT - specify axis explicitly
result = df.apply(lambda x: x + 1, axis=0) # Columns
result = df.apply(lambda row: row['Col1'] + row['Col2'], axis=1) # Rows

⚠️ Mistake #3: Not Returning Values

# ❌ WRONG - function doesn't return
def update_salary(salary):
    salary * 1.1 # Missing return!

# ✅ CORRECT
def update_salary(salary):
    return salary * 1.1

⚠️ Mistake #4: Not Handling Different Data Types

# ❌ WRONG - fails if column has strings
df.apply(lambda x: x * 2) # Error on string columns

# ✅ CORRECT - check data type
df.apply(lambda x: x * 2 if x.dtype in ['int64', 'float64'] else x)

apply() Mastery

You now understand apply() and data transformation:

  • apply() basics: Transform rows or columns with functions
  • Lambda functions: Quick anonymous functions for simple operations
  • Custom functions: Define functions for complex logic
  • Multiple columns: Access multiple columns in transformations
  • Alternatives: map() for Series, applymap()/map() for elements
  • Vectorization: Always preferred for performance (100x faster!)
  • Performance: Use numba/Cython for critical code

Key takeaway: Master vectorization first, use apply() only when necessary for complex custom logic!

📚 Learn more pandas tutorials at Pandas How-To – Your complete guide to data analysis in Python

Related articles: Lambda Functions, Map, Transform, Vectorization, Performance Optimization

Leave a Reply