What is fillna?
The fillna() method is one of the most critical pandas functions for data cleaning. It replaces NaN (Not a Number) and missing values with specified values, methods, or strategies.
Why is this important?
- Many pandas operations fail with missing values
- Machine learning algorithms can’t handle NaN values
- Data analysis becomes unreliable with incomplete data
- fillna() is the primary solution for data imputation
Common use cases:
- Fill missing ages with mean age
- Fill missing values with previous observation (forward fill)
- Fill missing values with next observation (backward fill)
- Fill missing values with interpolated values (for time series)
- Fill different columns with different values
Basic Syntax & Examples
Simple fillna with Scalar Value
The simplest way to fill missing values is with a single value:
import numpy as np# Create sample data with missing values
df = pd.DataFrame({
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’],
‘Age’: [25, np.nan, 30, np.nan],
‘Salary’: [50000, 60000, np.nan, 70000]
})print(“Original DataFrame:”)
print(df)# Fill missing values with 0
df_filled = df.fillna(0)
print(“\nAfter fillna(0):”)
print(df_filled)
Output:
Name Age Salary
0 Alice 25.0 50000
1 Bob NaN 60000
2 Charlie 30.0 NaN
3 David NaN 70000After fillna(0):
Name Age Salary
0 Alice 25.0 50000
1 Bob 0.0 60000
2 Charlie 30.0 0
3 David 0.0 70000
inplace=True to modify the original.Filling with Scalar Values
Fill All NaN with Same Value
mean_age = df[‘Age’].mean()
df[‘Age’].fillna(mean_age, inplace=True)# Fill with string value
df[‘Name’].fillna(‘Unknown’, inplace=True)
Fill with Different Values per Column
fill_values = {
‘Age’: df[‘Age’].mean(), # Mean age
‘Salary’: df[‘Salary’].median(), # Median salary
‘Name’: ‘Unknown’
}df_filled = df.fillna(fill_values)
print(df_filled)
Output:
0 Alice 25.00000 50000
1 Bob 27.50000 60000
2 Charlie 30.00000 70000
3 David 27.50000 70000
Filling with Methods (ffill & bfill)
Forward Fill (ffill) – Propagate Last Value
Forward fill takes the last valid observation and propagates it forward:
‘Date’: [‘2024-01-01’, ‘2024-01-02’, ‘2024-01-03’, ‘2024-01-04’],
‘Status’: [‘Active’, np.nan, np.nan, ‘Inactive’]
})# Forward fill
df_ffill = df.fillna(method=’ffill’)
print(df_ffill)# Or use the shorthand
df_ffill = df.ffill() # Same result
Output:
0 2024-01-01 Active
1 2024-01-02 Active # Filled from previous
2 2024-01-03 Active # Filled from previous
3 2024-01-04 Inactive
Backward Fill (bfill) – Propagate Next Value
Backward fill takes the next valid observation and propagates it backward:
df_bfill = df.fillna(method=’bfill’)
print(df_bfill)# Or use the shorthand
df_bfill = df.bfill() # Same result
Output:
0 2024-01-01 Active
1 2024-01-02 Inactive # Filled from next
2 2024-01-03 Inactive # Filled from next
3 2024-01-04 Inactive
Filling Column-Specific Values with Dictionary
Use a dictionary to fill different columns with different values:
‘Product’: [‘A’, np.nan, ‘C’, np.nan],
‘Price’: [100, np.nan, 300, 400],
‘Quantity’: [5, 10, np.nan, 20]
})# Fill with specific values per column
fill_dict = {
‘Product’: ‘Unknown Product’,
‘Price’: df[‘Price’].mean(),
‘Quantity’: 0
}df_filled = df.fillna(fill_dict)
print(df_filled)
Output:
0 A 100.0 5
1 Unknown Product 200.0 10
2 C 300.0 0
3 Unknown Product 400.0 20
Advanced: Limit Fill with limit Parameter
The limit parameter controls how many consecutive NaN values to fill:
‘Value’: [1, np.nan, np.nan, np.nan, 5, np.nan, np.nan]
})# Fill only first 2 NaN in forward direction
df_limited = df.fillna(method=’ffill’, limit=2)
print(df_limited)
Output:
0 1.0
1 1.0 # Filled (limit count: 1)
2 1.0 # Filled (limit count: 2)
3 NaN # Not filled (limit exceeded)
4 5.0
5 5.0 # Filled (limit count: 1)
6 NaN # Not filled (limit exceeded)
Interpolation for Time Series Data
For numeric data with a logical progression, interpolation fills missing values based on a pattern:
‘Day’: [1, 2, 3, 4, 5],
‘Temperature’: [20, np.nan, np.nan, 35, 40]
})# Linear interpolation
df[‘Temperature’] = df[‘Temperature’].interpolate(method=’linear’)
print(df)
Output:
0 1 20.00
1 2 23.75 # Interpolated
2 3 27.50 # Interpolated
3 4 35.00
4 5 40.00
Available interpolation methods:
| Method | Description | Use Case |
|---|---|---|
| linear | Straight line between points | Most common, good default |
| polynomial | Polynomial curve fitting | Non-linear relationships |
| nearest | Use nearest value | Categorical-like data |
| quadratic | Second-order polynomial | Smooth curves |
Inplace vs Copy
By default, fillna() returns a new DataFrame:
df_filled = df.fillna(0)# Original unchanged
print(df) # Still has NaN values# Inplace: modifies original DataFrame
df.fillna(0, inplace=True)
print(df) # NaN values are now 0
✅ Best Practices
- Use inplace=False (default) – Safer, allows comparison before/after
- Use inplace=True – When you’re sure and want to save memory
- Always assign result – Even with inplace=False, reassign to be safe
Real-World Examples
Example 1: Customer Age and Income Data
‘Customer’: [‘John’, ‘Jane’, ‘Bob’, ‘Alice’],
‘Age’: [25, np.nan, 35, np.nan],
‘Income’: [50000, 60000, np.nan, 80000]
})# Strategy: Use mean for age, median for income
df[‘Age’] = df[‘Age’].fillna(df[‘Age’].mean())
df[‘Income’] = df[‘Income’].fillna(df[‘Income’].median())print(df)
Example 2: Stock Price Time Series
‘Date’: pd.date_range(‘2024-01-01’, periods=7),
‘Price’: [100, np.nan, np.nan, 110, np.nan, 115, 120]
})# Forward fill for stock prices (assume price stays same until new data)
df[‘Price’] = df[‘Price’].ffill()print(df)
Example 3: Sensor Data with Interpolation
‘Hour’: range(6),
‘Humidity’: [60, np.nan, np.nan, 75, np.nan, 85]
})# Interpolate humidity values
df[‘Humidity’] = df[‘Humidity’].interpolate(method=’linear’)print(df)
Performance Tips & Best Practices
🚀 Performance Optimization
1. Use method parameter instead of loops
for col in df.columns:
df[col] = df[col].fillna(df[col].mean())# ✅ FAST – vectorized operation
df.fillna(df.mean(), inplace=True)
2. Use the appropriate fill method
df.ffill()# ✅ For specific values (efficient)
df.fillna({‘col1’: 0, ‘col2’: ‘N/A’})# ❌ For complex logic (slower, use apply as last resort)
df.fillna(df.apply(custom_logic), inplace=True)
3. Fill in the right order
df.fillna(df.mean(), inplace=True)# Then fill categorical columns
df.fillna(‘Unknown’, inplace=True)
Common Mistakes to Avoid
⚠️ Mistake #1: Forgetting to Assign Result
df.fillna(0)# ✅ CORRECT – assign the result
df = df.fillna(0)# OR use inplace
df.fillna(0, inplace=True)
⚠️ Mistake #2: Filling with Inappropriate Values
df[‘Age’].fillna(0)# ✅ CORRECT – use mean or median
df[‘Age’].fillna(df[‘Age’].mean())
⚠️ Mistake #3: Not Checking Fill Results
print(df.isnull().sum()) # Check remaining NaN# Or use inplace=False to compare
df_filled = df.fillna(0)
print(f”Original NaN count: {df.isnull().sum().sum()}”)
print(f”Filled NaN count: {df_filled.isnull().sum().sum()}”)
Key Takeaways
fillna() is essential for data cleaning. Here’s what you now know:
- Scalar Fill: Replace all NaN with a single value
- Dictionary Fill: Fill different columns with different values
- Forward/Backward Fill: Propagate values for time series
- Interpolation: Fill based on mathematical patterns
- Limit Parameter: Control how many consecutive values to fill
- Inplace: Modify original DataFrame directly
- Performance: Use vectorized operations, not loops
Next steps: Practice with your own datasets and choose the fill method that makes sense for your data type and analysis goals.
