What is apply()?
The apply() method applies a function along an axis (rows or columns) of a DataFrame. It’s a powerful tool for data transformation when built-in methods aren’t sufficient.
When to use apply():
- Transform data with custom logic that pandas doesn’t provide
- Apply same operation to every row or column
- Conditional transformations based on multiple columns
- Convert data types or formats
- Create new calculated columns
Key variants:
- apply(): Apply function to rows or columns of DataFrame
- applymap(): Apply function to each element (deprecated in pandas 2.1, use map())
- map(): Apply function to Series elements
⚠️ Performance Warning: apply() can be slow on large datasets. Vectorization is almost always faster!
Basic Syntax and Usage
import pandas as pd
import numpy as np
# Create sample DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 75000, 60000, 55000]
})
print("Original DataFrame:")
print(df)
# Basic syntax:
# df.apply(function, axis=0) # axis=0: columns, axis=1: rows
Understanding axis parameter
# axis=0: Apply function to each column (top to bottom)
# axis=1: Apply function to each row (left to right)
# Example: Get column maximums
df.apply(lambda x: x.max(), axis=0)
# Example: Get row sums
df.apply(lambda x: x.sum(), axis=1)
Apply to Columns (axis=0)
Simple Column Operation
# Get the maximum value for each column
result = df.apply(lambda x: x.max())
print(result)
Output:
Name David
Age 40
Salary 75000
dtype: object
Multiple Statistics
# Get min and max for numeric columns
df.apply(lambda x: x.max() - x.min() if x.dtype in ['int64', 'float64'] else 'N/A')
Column-Specific Functions
# Different function per column
def process_column(series):
if series.dtype == 'object': # String column
return series.str.upper()
else: # Numeric column
return series * 1.1 # Increase by 10%
result = df.apply(process_column)
print(result)
Apply to Rows (axis=1)
Simple Row Operation
# Sum all numeric columns for each row
df['Total'] = df.apply(lambda row: row['Age'] + row['Salary'] if isinstance(row['Salary'], (int, float)) else 0, axis=1)
print(df)
Create New Columns from Multiple Existing Columns
# Create 'Senior' column based on Age
df['Senior'] = df.apply(lambda row: row['Age'] > 35, axis=1)
# Create 'Status' based on multiple conditions
df['Status'] = df.apply(
lambda row: 'High Earner' if row['Salary'] > 60000 else 'Standard',
axis=1
)
print(df)
Output:
Name Age Salary Senior Status
0 Alice 25 50000 False Standard
1 Bob 30 75000 False High Earner
2 Charlie 35 60000 False High Earner
3 David 40 55000 True Standard
Row Maximum Value
# Find maximum value in each row (numeric columns only)
df_numeric = df[['Age', 'Salary']]
max_per_row = df_numeric.apply(lambda row: row.max(), axis=1)
print(max_per_row)
Lambda Functions Explained
Lambda is a shorthand way to create anonymous functions. Essential for apply()!
Lambda Syntax
# Basic lambda
lambda x: x * 2
# Lambda with multiple arguments
lambda x, y: x + y
# Lambda with conditions
lambda x: 'Even' if x % 2 == 0 else 'Odd'
# In apply() context:
df['Age'].apply(lambda x: x * 2) # Double each age
Lambda vs Named Function
# Using lambda (concise)
df['Age'].apply(lambda x: x ** 2)
# Using named function (clearer for complex logic)
def square(x):
return x ** 2
df['Age'].apply(square) # Same result
Lambda with Conditions
# Convert age to age group
df['Age_Group'] = df['Age'].apply(
lambda x: 'Young' if x < 30 else ('Middle' if x < 40 else 'Senior')
)
print(df)
Lambda with Multiple Conditions
# Complex categorization
df['Category'] = df.apply(
lambda row: 'High Pay Senior' if row['Salary'] > 60000 and row['Age'] > 35
else ('High Pay Junior' if row['Salary'] > 60000 else 'Standard'),
axis=1
)
💡 Lambda Tips:
- Lambda is great for short, simple operations
- Use def for complex multi-line logic
- Lambda functions have access to variables in scope
- Can’t include statements, only expressions
Custom Functions with apply()
Single Parameter Functions
# Define custom function
def classify_age(age):
if age < 25:
return 'Young'
elif age < 35:
return 'Mid-Career'
else:
return 'Senior'
# Apply to column
df['Age_Category'] = df['Age'].apply(classify_age)
print(df)
Multi-Parameter Functions with args
# Function with additional parameters
def scale_salary(salary, multiplier=1.1):
return salary * multiplier
# Apply with args parameter
df['Scaled_Salary'] = df['Salary'].apply(scale_salary, args=(1.2,))
# Or with kwargs
df['Scaled_Salary'] = df['Salary'].apply(scale_salary, multiplier=1.15)
Row-Based Custom Function
# Function using entire row
def calculate_bonus(row):
base_bonus = row['Salary'] * 0.1 # 10% base
if row['Age'] > 30:
base_bonus *= 1.2 # 20% extra for experience
return base_bonus
df['Bonus'] = df.apply(calculate_bonus, axis=1)
print(df[['Name', 'Salary', 'Bonus']])
Apply with Multiple Columns
Access Multiple Columns in apply()
# Create salary-to-age ratio
df['Salary_Per_Year_Age'] = df.apply(
lambda row: row['Salary'] / row['Age'],
axis=1
)
# String concatenation from multiple columns
df['Full_Info'] = df.apply(
lambda row: f"{row['Name']} ({row['Age']} years old)",
axis=1
)
print(df)
Conditional Logic Across Columns
# Complex logic using multiple columns
df['Performance'] = df.apply(
lambda row: 'Excellent' if row['Salary'] > 70000 and row['Age'] > 30
else ('Good' if row['Salary'] > 50000 else 'Average'),
axis=1
)
Apply to Subset of Columns
# Apply only to numeric columns
numeric_cols = df.select_dtypes(include=['number'])
result = numeric_cols.apply(lambda x: x.mean())
print(result)
applymap vs apply vs map
| Method | Target | Use Case | Example |
|---|---|---|---|
| apply() | Rows/Columns | Transform along axis | df.apply(lambda x: x.sum()) |
| map() | Series elements | Transform individual values | df[‘col’].map(lambda x: x*2) |
| applymap() | DataFrame elements | Format all values | df.applymap(lambda x: f'{x:.2f}’) |
map() for Series
# Map with Series
status_map = {25: 'Young', 30: 'Mid', 35: 'Senior'}
df['Status'] = df['Age'].map(status_map)
# Map with function
df['Age_Double'] = df['Age'].map(lambda x: x * 2)
applymap() for DataFrame (Element-wise)
# Format all numeric values to 2 decimal places
df_formatted = df.applymap(lambda x: f'{x:.2f}' if isinstance(x, (int, float)) else x)
💡 Note: In pandas 2.1+, applymap() is deprecated. Use map() for Series or DataFrame.map() for DataFrames instead.
Vectorization – The Better Alternative
For most cases, vectorized operations are 10-100x faster than apply()!
Simple Arithmetic
# ❌ SLOW - using apply with lambda
df['Salary_Double'] = df['Salary'].apply(lambda x: x * 2)
# ✅ FAST - vectorized operation
df['Salary_Double'] = df['Salary'] * 2
String Operations
# ❌ SLOW - using apply
df['Name_Upper'] = df['Name'].apply(lambda x: x.upper())
# ✅ FAST - vectorized
df['Name_Upper'] = df['Name'].str.upper()
Conditional Logic with np.select()
import numpy as np
# ❌ SLOW - using apply
df['Status'] = df['Salary'].apply(
lambda x: 'High' if x > 60000 else 'Low'
)
# ✅ FAST - vectorized with np.where
df['Status'] = np.where(df['Salary'] > 60000, 'High', 'Low')
# ✅ Even better for multiple conditions - np.select
conditions = [df['Salary'] > 70000, df['Salary'] > 50000, df['Salary'] <= 50000]
values = ['Very High', 'High', 'Low']
df['Status'] = np.select(conditions, values)
When to Avoid apply()
# ❌ DON'T use apply for these (too slow):
df.apply(lambda x: x * 2) # Use df * 2
df.apply(lambda x: x.sum()) # Use df.sum()
df.apply(lambda x: x > 5) # Use df > 5
df['Col'].apply(lambda x: x.upper()) # Use df['Col'].str.upper()
# ✅ OK to use apply() for:
df.apply(custom_complex_function) # When custom logic can't be vectorized
df.apply(lambda row: row_specific_logic, axis=1) # Row-specific calculations
Performance Optimization
🚀 Speed Up apply()
1. Use Vectorization First
# Benchmark: 1M rows
# apply(): ~50 seconds
# Vectorized: ~0.5 seconds
# 100x faster with vectorization!
2. Use numba for Complex Calculations
from numba import jit
@jit(nopython=True)
def complex_calc(x):
result = 0
for i in range(x):
result += i ** 2
return result
# Much faster than apply()
3. Use Cython for Critical Code
# For functions called millions of times,
# consider Cython compilation (advanced)
4. Minimize I/O in Functions
# ❌ SLOW - function calls other objects
def slow_func(x):
return x * global_variable
# ✅ FAST - function is self-contained
df.apply(lambda x: x * 2)
Common Mistakes to Avoid
⚠️ Mistake #1: Using apply() When Vectorization Works
# ❌ WRONG - 100x slower
df['Result'] = df['Salary'].apply(lambda x: x * 1.1)
# ✅ CORRECT - vectorized
df['Result'] = df['Salary'] * 1.1
⚠️ Mistake #2: Forgetting axis Parameter
# ❌ WRONG - applies to columns (confusing result)
result = df.apply(lambda x: x + 1) # axis=0 by default
# ✅ CORRECT - specify axis explicitly
result = df.apply(lambda x: x + 1, axis=0) # Columns
result = df.apply(lambda row: row['Col1'] + row['Col2'], axis=1) # Rows
⚠️ Mistake #3: Not Returning Values
# ❌ WRONG - function doesn't return
def update_salary(salary):
salary * 1.1 # Missing return!
# ✅ CORRECT
def update_salary(salary):
return salary * 1.1
⚠️ Mistake #4: Not Handling Different Data Types
# ❌ WRONG - fails if column has strings
df.apply(lambda x: x * 2) # Error on string columns
# ✅ CORRECT - check data type
df.apply(lambda x: x * 2 if x.dtype in ['int64', 'float64'] else x)
apply() Mastery
You now understand apply() and data transformation:
- apply() basics: Transform rows or columns with functions
- Lambda functions: Quick anonymous functions for simple operations
- Custom functions: Define functions for complex logic
- Multiple columns: Access multiple columns in transformations
- Alternatives: map() for Series, applymap()/map() for elements
- Vectorization: Always preferred for performance (100x faster!)
- Performance: Use numba/Cython for critical code
Key takeaway: Master vectorization first, use apply() only when necessary for complex custom logic!
