Pandas apply: Transform Data with Functions Complete Guide

What is apply()?

The apply() method applies a function along an axis (rows or columns) of a DataFrame. It’s a powerful tool for data transformation when built-in methods aren’t sufficient.

When to use apply():

Transform data with custom logic that pandas doesn’t provide
Apply same operation to every row or column
Conditional transformations based on multiple columns
Convert data types or formats
Create new calculated columns

Key variants:

apply(): Apply function to rows or columns of DataFrame
applymap(): Apply function to each element (deprecated in pandas 2.1, use map())
map(): Apply function to Series elements

⚠️ Performance Warning: apply() can be slow on large datasets. Vectorization is almost always faster!

Basic Syntax and Usage

import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 75000, 60000, 55000]
})

print("Original DataFrame:")
print(df)

# Basic syntax:
# df.apply(function, axis=0) # axis=0: columns, axis=1: rows

Understanding axis parameter

# axis=0: Apply function to each column (top to bottom)
# axis=1: Apply function to each row (left to right)

# Example: Get column maximums
df.apply(lambda x: x.max(), axis=0)

# Example: Get row sums
df.apply(lambda x: x.sum(), axis=1)

Apply to Columns (axis=0)

Simple Column Operation

# Get the maximum value for each column
result = df.apply(lambda x: x.max())
print(result)

Output:

Name             David
Age                 40
Salary          75000
dtype: object

Multiple Statistics

# Get min and max for numeric columns
df.apply(lambda x: x.max() - x.min() if x.dtype in ['int64', 'float64'] else 'N/A')

Column-Specific Functions

# Different function per column
def process_column(series):
    if series.dtype == 'object': # String column
        return series.str.upper()
    else: # Numeric column
        return series * 1.1 # Increase by 10%

result = df.apply(process_column)
print(result)

Apply to Rows (axis=1)

Simple Row Operation

# Sum all numeric columns for each row
df['Total'] = df.apply(lambda row: row['Age'] + row['Salary'] if isinstance(row['Salary'], (int, float)) else 0, axis=1)
print(df)

Create New Columns from Multiple Existing Columns

# Create 'Senior' column based on Age
df['Senior'] = df.apply(lambda row: row['Age'] > 35, axis=1)

# Create 'Status' based on multiple conditions
df['Status'] = df.apply(
    lambda row: 'High Earner' if row['Salary'] > 60000 else 'Standard',
    axis=1
)
print(df)

Output:

     Name  Age  Salary  Senior       Status
0   Alice   25   50000   False      Standard
1     Bob   30   75000   False   High Earner
2 Charlie   35   60000   False   High Earner
3   David   40   55000    True      Standard

Row Maximum Value

# Find maximum value in each row (numeric columns only)
df_numeric = df[['Age', 'Salary']]
max_per_row = df_numeric.apply(lambda row: row.max(), axis=1)
print(max_per_row)

Lambda Functions Explained

Lambda is a shorthand way to create anonymous functions. Essential for apply()!

Lambda Syntax

# Basic lambda
lambda x: x * 2

# Lambda with multiple arguments
lambda x, y: x + y

# Lambda with conditions
lambda x: 'Even' if x % 2 == 0 else 'Odd'

# In apply() context:
df['Age'].apply(lambda x: x * 2) # Double each age

Lambda vs Named Function

# Using lambda (concise)
df['Age'].apply(lambda x: x ** 2)

# Using named function (clearer for complex logic)
def square(x):
    return x ** 2

df['Age'].apply(square) # Same result

Lambda with Conditions

# Convert age to age group
df['Age_Group'] = df['Age'].apply(
    lambda x: 'Young' if x < 30 else ('Middle' if x < 40 else 'Senior')
)
print(df)

Lambda with Multiple Conditions

# Complex categorization
df['Category'] = df.apply(
    lambda row: 'High Pay Senior' if row['Salary'] > 60000 and row['Age'] > 35
    else ('High Pay Junior' if row['Salary'] > 60000 else 'Standard'),
    axis=1
)

💡 Lambda Tips:

Lambda is great for short, simple operations
Use def for complex multi-line logic
Lambda functions have access to variables in scope
Can’t include statements, only expressions

Custom Functions with apply()

Single Parameter Functions

# Define custom function
def classify_age(age):
    if age < 25:
        return 'Young'
    elif age < 35:
        return 'Mid-Career'
    else:
        return 'Senior'

# Apply to column
df['Age_Category'] = df['Age'].apply(classify_age)
print(df)

Multi-Parameter Functions with args

# Function with additional parameters
def scale_salary(salary, multiplier=1.1):
    return salary * multiplier

# Apply with args parameter
df['Scaled_Salary'] = df['Salary'].apply(scale_salary, args=(1.2,))

# Or with kwargs
df['Scaled_Salary'] = df['Salary'].apply(scale_salary, multiplier=1.15)

Row-Based Custom Function

# Function using entire row
def calculate_bonus(row):
    base_bonus = row['Salary'] * 0.1 # 10% base
    if row['Age'] > 30:
        base_bonus *= 1.2 # 20% extra for experience
    return base_bonus

df['Bonus'] = df.apply(calculate_bonus, axis=1)
print(df[['Name', 'Salary', 'Bonus']])

Apply with Multiple Columns

Access Multiple Columns in apply()

# Create salary-to-age ratio
df['Salary_Per_Year_Age'] = df.apply(
    lambda row: row['Salary'] / row['Age'],
    axis=1
)

# String concatenation from multiple columns
df['Full_Info'] = df.apply(
    lambda row: f"{row['Name']} ({row['Age']} years old)",
    axis=1
)
print(df)

Conditional Logic Across Columns

# Complex logic using multiple columns
df['Performance'] = df.apply(
    lambda row: 'Excellent' if row['Salary'] > 70000 and row['Age'] > 30
    else ('Good' if row['Salary'] > 50000 else 'Average'),
    axis=1
)

Apply to Subset of Columns

# Apply only to numeric columns
numeric_cols = df.select_dtypes(include=['number'])
result = numeric_cols.apply(lambda x: x.mean())
print(result)

applymap vs apply vs map

Method	Target	Use Case	Example
apply()	Rows/Columns	Transform along axis	df.apply(lambda x: x.sum())
map()	Series elements	Transform individual values	df[‘col’].map(lambda x: x*2)
applymap()	DataFrame elements	Format all values	df.applymap(lambda x: f'{x:.2f}’)

map() for Series

# Map with Series
status_map = {25: 'Young', 30: 'Mid', 35: 'Senior'}
df['Status'] = df['Age'].map(status_map)

# Map with function
df['Age_Double'] = df['Age'].map(lambda x: x * 2)

applymap() for DataFrame (Element-wise)

# Format all numeric values to 2 decimal places
df_formatted = df.applymap(lambda x: f'{x:.2f}' if isinstance(x, (int, float)) else x)

💡 Note: In pandas 2.1+, applymap() is deprecated. Use map() for Series or DataFrame.map() for DataFrames instead.

Vectorization – The Better Alternative

For most cases, vectorized operations are 10-100x faster than apply()!

Simple Arithmetic

# ❌ SLOW - using apply with lambda
df['Salary_Double'] = df['Salary'].apply(lambda x: x * 2)

# ✅ FAST - vectorized operation
df['Salary_Double'] = df['Salary'] * 2

String Operations

# ❌ SLOW - using apply
df['Name_Upper'] = df['Name'].apply(lambda x: x.upper())

# ✅ FAST - vectorized
df['Name_Upper'] = df['Name'].str.upper()

Conditional Logic with np.select()

import numpy as np

# ❌ SLOW - using apply
df['Status'] = df['Salary'].apply(
    lambda x: 'High' if x > 60000 else 'Low'
)

# ✅ FAST - vectorized with np.where
df['Status'] = np.where(df['Salary'] > 60000, 'High', 'Low')

# ✅ Even better for multiple conditions - np.select
conditions = [df['Salary'] > 70000, df['Salary'] > 50000, df['Salary'] <= 50000]
values = ['Very High', 'High', 'Low']
df['Status'] = np.select(conditions, values)

When to Avoid apply()

# ❌ DON'T use apply for these (too slow):
df.apply(lambda x: x * 2) # Use df * 2
df.apply(lambda x: x.sum()) # Use df.sum()
df.apply(lambda x: x > 5) # Use df > 5
df['Col'].apply(lambda x: x.upper()) # Use df['Col'].str.upper()

# ✅ OK to use apply() for:
df.apply(custom_complex_function) # When custom logic can't be vectorized
df.apply(lambda row: row_specific_logic, axis=1) # Row-specific calculations

Performance Optimization

🚀 Speed Up apply()

1. Use Vectorization First

# Benchmark: 1M rows
# apply(): ~50 seconds
# Vectorized: ~0.5 seconds

# 100x faster with vectorization!

2. Use numba for Complex Calculations

from numba import jit

@jit(nopython=True)
def complex_calc(x):
    result = 0
    for i in range(x):
        result += i ** 2
    return result

# Much faster than apply()

3. Use Cython for Critical Code

# For functions called millions of times,
# consider Cython compilation (advanced)

4. Minimize I/O in Functions

# ❌ SLOW - function calls other objects
def slow_func(x):
    return x * global_variable

# ✅ FAST - function is self-contained
df.apply(lambda x: x * 2)

Common Mistakes to Avoid

⚠️ Mistake #1: Using apply() When Vectorization Works

# ❌ WRONG - 100x slower
df['Result'] = df['Salary'].apply(lambda x: x * 1.1)

# ✅ CORRECT - vectorized
df['Result'] = df['Salary'] * 1.1

⚠️ Mistake #2: Forgetting axis Parameter

# ❌ WRONG - applies to columns (confusing result)
result = df.apply(lambda x: x + 1) # axis=0 by default

# ✅ CORRECT - specify axis explicitly
result = df.apply(lambda x: x + 1, axis=0) # Columns
result = df.apply(lambda row: row['Col1'] + row['Col2'], axis=1) # Rows

⚠️ Mistake #3: Not Returning Values

# ❌ WRONG - function doesn't return
def update_salary(salary):
    salary * 1.1 # Missing return!

# ✅ CORRECT
def update_salary(salary):
    return salary * 1.1

⚠️ Mistake #4: Not Handling Different Data Types

# ❌ WRONG - fails if column has strings
df.apply(lambda x: x * 2) # Error on string columns

# ✅ CORRECT - check data type
df.apply(lambda x: x * 2 if x.dtype in ['int64', 'float64'] else x)

apply() Mastery

You now understand apply() and data transformation:

apply() basics: Transform rows or columns with functions
Lambda functions: Quick anonymous functions for simple operations
Custom functions: Define functions for complex logic
Multiple columns: Access multiple columns in transformations
Alternatives: map() for Series, applymap()/map() for elements
Vectorization: Always preferred for performance (100x faster!)
Performance: Use numba/Cython for critical code

Key takeaway: Master vectorization first, use apply() only when necessary for complex custom logic!