What is Filtering?
Filtering in pandas means selecting rows that meet specific conditions. It’s one of the most fundamental operations in data analysis.
Common filtering scenarios:
- Select customers with purchases over $1,000
- Find data from a specific date range
- Get rows where a column equals a specific value
- Filter multiple conditions simultaneously (AND, OR logic)
- Find text matching a pattern (substring, regex)
Why filtering matters:
- Focus analysis on relevant data
- Handle large datasets efficiently
- Build data pipelines and workflows
- Prepare data for machine learning
- Generate reports by category or condition
Basic Filtering by Column Value
import pandas as pd
# Create sample data
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 28],
'Department': ['Sales', 'IT', 'HR', 'Sales', 'IT'],
'Salary': [50000, 75000, 60000, 55000, 70000]
})
print("Original DataFrame:")
print(df)
Simple Equality Filter
# Filter where Department equals 'IT'
filtered = df[df['Department'] == 'IT']
print(filtered)
Output:
Name Age Department Salary
1 Bob 30 IT 75000
4 Eve 28 IT 70000
💡 How it works:
df['Department'] == 'IT' creates a boolean Series (True/False values), and df[boolean_series] selects rows where True.Boolean Indexing Explained
Step-by-Step: How Boolean Indexing Works
# Step 1: Create boolean mask
mask = df['Age'] > 30
print("Boolean mask:")
print(mask)
# Output:
# 0 False
# 1 False
# 2 True
# 3 True
# 4 False
# Step 2: Use mask to filter DataFrame
filtered = df[mask]
print("\nFiltered result:")
print(filtered)
Combined (one-liner):
# Same result in one line
filtered = df[df['Age'] > 30]
print(filtered)
💡 Key concept: The boolean mask must have the same length as the DataFrame. Pandas uses True to include rows and False to exclude them.
Single Condition Filtering
Numeric Comparisons
# Greater than
df[df['Age'] > 30]
# Greater than or equal
df[df['Salary'] >= 60000]
# Less than
df[df['Age'] < 30]
# Not equal
df[df['Department'] != 'HR']
String Matching
# Exact match
df[df['Name'] == 'Alice']
# Not equal
df[df['Department'] != 'Sales']
# Using isin() for multiple values
df[df['Department'].isin(['Sales', 'IT'])]
Filtering NaN/Null Values
# Keep rows with non-null values in a column
df[df['Age'].notna()]
# Keep rows with null values
df[df['Age'].isna()]
Multiple Condition Filtering
AND Condition (&)
# Age > 28 AND Salary > 60000
filtered = df[(df['Age'] > 28) & (df['Salary'] > 60000)]
print(filtered)
Output:
Name Age Department Salary
1 Bob 30 IT 75000
2 Charlie 35 HR 60000
OR Condition (|)
# Department == 'Sales' OR Age > 35
filtered = df[(df['Department'] == 'Sales') | (df['Age'] > 35)]
print(filtered)
Combining AND and OR
# (Department == 'Sales' OR Department == 'IT') AND Salary > 60000
filtered = df[(df['Department'].isin(['Sales', 'IT'])) & (df['Salary'] > 60000)]
print(filtered)
NOT Condition (~)
# NOT in Sales department
filtered = df[~(df['Department'] == 'Sales')]
# Equivalent to:
filtered = df[df['Department'] != 'Sales']
⚠️ Critical: Use
& (AND), | (OR), ~ (NOT) – NOT the Python keywords and, or, not!String Filtering with .str Accessor
Substring Search
# Find names containing 'ar'
filtered = df[df['Name'].str.contains('ar')]
print(filtered) # Charlie and David
Case-Insensitive Search
# Find names containing 'a' (case-insensitive)
filtered = df[df['Name'].str.contains('a', case=False)]
print(filtered) # Alice, Charlie, David
String Methods
# Starts with
df[df['Name'].str.startswith('A')]
# Ends with
df[df['Name'].str.endswith('ie')] # Charlie
# String length
df[df['Name'].str.len() > 5] # Charlie, David
# Uppercase
df[df['Name'].str.upper() == 'BOB']
Regular Expressions
# Match pattern (names with 2+ vowels)
filtered = df[df['Name'].str.contains(r'[aeiou]{2,}', regex=True)]
# Match names ending with 'e'
filtered = df[df['Name'].str.match(r'.*e$', regex=True)]
Numeric Filtering Techniques
Range Filtering
# Age between 28 and 35 (inclusive)
filtered = df[(df['Age'] >= 28) & (df['Age'] <= 35)] # Salary between 50k and 70k filtered = df[(df['Salary'] >= 50000) & (df['Salary'] <= 70000)]
Between Method (Cleaner)
# More readable
filtered = df[df['Age'].between(28, 35)]
print(filtered)
Top N Filtering
# Top 3 salaries (using nlargest)
top_earners = df.nlargest(3, 'Salary')
# Bottom 2 salaries (using nsmallest)
lowest_earners = df.nsmallest(2, 'Salary')
Statistical Filtering
# Salaries above mean
mean_salary = df['Salary'].mean()
above_average = df[df['Salary'] > mean_salary]
# Values within 1 standard deviation
std_dev = df['Age'].std()
mean_age = df['Age'].mean()
normal_range = df[(df['Age'] >= mean_age - std_dev) &
(df['Age'] <= mean_age + std_dev)]
Date Range Filtering
Filter by Date Range
import pandas as pd
df = pd.DataFrame({
'Date': pd.date_range('2024-01-01', periods=10),
'Sales': [100, 120, 150, 110, 90, 200, 180, 160, 140, 130]
})
# Filter between dates
start_date = '2024-01-05'
end_date = '2024-01-08'
filtered = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
print(filtered)
Recent Data Only
# Last 5 days
filtered = df[df['Date'] >= '2024-01-06']
# Specific month
filtered = df[df['Date'].dt.month == 1]
# Specific year
filtered = df[df['Date'].dt.year == 2024]
Time-Based Comparisons
# Last 7 days from today
from datetime import timedelta
recent = df[df['Date'] >= (pd.Timestamp.now() - timedelta(days=7))]
# Data from last month
last_month_start = pd.Timestamp.now() - timedelta(days=30)
filtered = df[df['Date'] >= last_month_start]
Using isin() for Multiple Values
Filter by Multiple Values
# Department is Sales OR IT OR HR
departments = ['Sales', 'IT', 'HR']
filtered = df[df['Department'].isin(departments)]
# More concise than:
# df[(df['Department'] == 'Sales') | (df['Department'] == 'IT') | (df['Department'] == 'HR')]
isin() with Numeric Values
# Ages 25, 30, or 35
filtered = df[df['Age'].isin([25, 30, 35])]
# Salary in specific amounts
filtered = df[df['Salary'].isin([50000, 60000, 70000])]
NOT isin() (Exclude Values)
# NOT in these departments
filtered = df[~df['Department'].isin(['HR', 'IT'])]
# Same as:
# df[(df['Department'] != 'HR') & (df['Department'] != 'IT')]
💡 Pro tip: Use
isin() when filtering by multiple values. It’s cleaner and more efficient than multiple OR conditions.Alternative Filter Methods
Using query() Method
# String-based filtering (more readable for complex conditions)
filtered = df.query('Age > 30 and Salary > 60000')
# Equivalent to:
# df[(df['Age'] > 30) & (df['Salary'] > 60000)]
Using filter() for Column Selection
# Filter columns, not rows (different use)
df.filter(['Name', 'Salary']) # Select columns
# Filter with regex
df.filter(regex='^S') # Columns starting with S
Using loc[] for Complex Filtering
# loc with conditions
filtered = df.loc[df['Age'] > 30]
filtered = df.loc[(df['Age'] > 30) & (df['Salary'] > 60000), ['Name', 'Salary']]
Using mask() for Inverted Logic
# mask() returns rows where condition is False (opposite of filtering)
result = df.mask(df['Age'] > 30) # Returns rows where Age <= 30
Performance Optimization
🚀 Filter Performance Tips
1. Use Simple Conditions First
# ✅ FAST - simple numeric comparison
df[df['Age'] > 30]
# ❌ SLOWER - string operations
df[df['Name'].str.contains('A')]
2. Combine Conditions in Single Filter
# ✅ FAST - single operation
filtered = df[(df['Age'] > 30) & (df['Salary'] > 60000)]
# ❌ SLOWER - multiple filters
filtered = df[df['Age'] > 30]
filtered = filtered[filtered['Salary'] > 60000]
3. Use query() for Complex Logic
# ✅ FAST and readable
df.query('Age > 30 and Salary > 60000 and Department != "HR"')
# ❌ SLOWER and harder to read
df[(df['Age'] > 30) & (df['Salary'] > 60000) & (df['Department'] != 'HR')]
4. Use isin() for Multiple Values
# ✅ FAST
df[df['Department'].isin(['Sales', 'IT', 'HR'])]
# ❌ SLOWER
df[(df['Department'] == 'Sales') | (df['Department'] == 'IT') | (df['Department'] == 'HR')]
Common Mistakes to Avoid
⚠️ Mistake #1: Using ‘and’ Instead of ‘&’
# ❌ WRONG - doesn't work with pandas
df[df['Age'] > 30 and df['Salary'] > 60000]
# ✅ CORRECT
df[(df['Age'] > 30) & (df['Salary'] > 60000)]
⚠️ Mistake #2: Forgetting Parentheses
# ❌ WRONG - operator precedence issue
df[df['Age'] > 30 & df['Salary'] > 60000]
# ✅ CORRECT
df[(df['Age'] > 30) & (df['Salary'] > 60000)]
⚠️ Mistake #3: Not Assigning Result
# ❌ WRONG - doesn't modify df
df[df['Age'] > 30]
# ✅ CORRECT
filtered = df[df['Age'] > 30]
# OR
df = df[df['Age'] > 30]
⚠️ Mistake #4: String Comparison Case Sensitivity
# ❌ WRONG - case mismatch
df[df['Department'] == 'it'] # Returns nothing if stored as 'IT'
# ✅ CORRECT
df[df['Department'].str.lower() == 'it']
# OR
df[df['Department'].str.upper() == 'IT']
Filtering Mastery
You now understand pandas filtering comprehensively:
- Boolean indexing: Create and apply True/False masks
- Single conditions: Numeric, string, and null checks
- Multiple conditions: AND (&), OR (|), NOT (~) logic
- String filtering: Contains, startswith, regex patterns
- Numeric filtering: Ranges, comparisons, statistical
- Date filtering: Time-based selection
- isin(): Multiple value matching
- Alternative methods: query(), loc[], mask()
Next step: Practice with your own datasets to build intuition for creating complex filters!
