Pandas filter: Data Selection and Conditional Filtering Complete Guide

What is Filtering?

Filtering in pandas means selecting rows that meet specific conditions. It’s one of the most fundamental operations in data analysis.

Common filtering scenarios:

  • Select customers with purchases over $1,000
  • Find data from a specific date range
  • Get rows where a column equals a specific value
  • Filter multiple conditions simultaneously (AND, OR logic)
  • Find text matching a pattern (substring, regex)

Why filtering matters:

  • Focus analysis on relevant data
  • Handle large datasets efficiently
  • Build data pipelines and workflows
  • Prepare data for machine learning
  • Generate reports by category or condition

Basic Filtering by Column Value

import pandas as pd

# Create sample data
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 40, 28],
    'Department': ['Sales', 'IT', 'HR', 'Sales', 'IT'],
    'Salary': [50000, 75000, 60000, 55000, 70000]
})

print("Original DataFrame:")
print(df)

Simple Equality Filter

# Filter where Department equals 'IT'
filtered = df[df['Department'] == 'IT']
print(filtered)

Output:

  Name  Age Department  Salary
1  Bob   30         IT   75000
4  Eve   28         IT   70000
💡 How it works: df['Department'] == 'IT' creates a boolean Series (True/False values), and df[boolean_series] selects rows where True.

Boolean Indexing Explained

Step-by-Step: How Boolean Indexing Works

# Step 1: Create boolean mask
mask = df['Age'] > 30
print("Boolean mask:")
print(mask)
# Output:
# 0    False
# 1    False
# 2     True
# 3     True
# 4    False

# Step 2: Use mask to filter DataFrame
filtered = df[mask]
print("\nFiltered result:")
print(filtered)

Combined (one-liner):

# Same result in one line
filtered = df[df['Age'] > 30]
print(filtered)
💡 Key concept: The boolean mask must have the same length as the DataFrame. Pandas uses True to include rows and False to exclude them.

Single Condition Filtering

Numeric Comparisons

# Greater than
df[df['Age'] > 30]

# Greater than or equal
df[df['Salary'] >= 60000]

# Less than
df[df['Age'] < 30]

# Not equal
df[df['Department'] != 'HR']

String Matching

# Exact match
df[df['Name'] == 'Alice']

# Not equal
df[df['Department'] != 'Sales']

# Using isin() for multiple values
df[df['Department'].isin(['Sales', 'IT'])]

Filtering NaN/Null Values

# Keep rows with non-null values in a column
df[df['Age'].notna()]

# Keep rows with null values
df[df['Age'].isna()]

Multiple Condition Filtering

AND Condition (&)

# Age > 28 AND Salary > 60000
filtered = df[(df['Age'] > 28) & (df['Salary'] > 60000)]
print(filtered)

Output:

  Name  Age Department  Salary
1  Bob   30         IT   75000
2 Charlie 35  HR   60000

OR Condition (|)

# Department == 'Sales' OR Age > 35
filtered = df[(df['Department'] == 'Sales') | (df['Age'] > 35)]
print(filtered)

Combining AND and OR

# (Department == 'Sales' OR Department == 'IT') AND Salary > 60000
filtered = df[(df['Department'].isin(['Sales', 'IT'])) & (df['Salary'] > 60000)]
print(filtered)

NOT Condition (~)

# NOT in Sales department
filtered = df[~(df['Department'] == 'Sales')]

# Equivalent to:
filtered = df[df['Department'] != 'Sales']
⚠️ Critical: Use & (AND), | (OR), ~ (NOT) – NOT the Python keywords and, or, not!

String Filtering with .str Accessor

Substring Search

# Find names containing 'ar'
filtered = df[df['Name'].str.contains('ar')]
print(filtered) # Charlie and David

Case-Insensitive Search

# Find names containing 'a' (case-insensitive)
filtered = df[df['Name'].str.contains('a', case=False)]
print(filtered) # Alice, Charlie, David

String Methods

# Starts with
df[df['Name'].str.startswith('A')]

# Ends with
df[df['Name'].str.endswith('ie')] # Charlie

# String length
df[df['Name'].str.len() > 5] # Charlie, David

# Uppercase
df[df['Name'].str.upper() == 'BOB']

Regular Expressions

# Match pattern (names with 2+ vowels)
filtered = df[df['Name'].str.contains(r'[aeiou]{2,}', regex=True)]

# Match names ending with 'e'
filtered = df[df['Name'].str.match(r'.*e$', regex=True)]

Numeric Filtering Techniques

Range Filtering

# Age between 28 and 35 (inclusive)
filtered = df[(df['Age'] >= 28) & (df['Age'] <= 35)] # Salary between 50k and 70k filtered = df[(df['Salary'] >= 50000) & (df['Salary'] <= 70000)]

Between Method (Cleaner)

# More readable
filtered = df[df['Age'].between(28, 35)]
print(filtered)

Top N Filtering

# Top 3 salaries (using nlargest)
top_earners = df.nlargest(3, 'Salary')

# Bottom 2 salaries (using nsmallest)
lowest_earners = df.nsmallest(2, 'Salary')

Statistical Filtering

# Salaries above mean
mean_salary = df['Salary'].mean()
above_average = df[df['Salary'] > mean_salary]

# Values within 1 standard deviation
std_dev = df['Age'].std()
mean_age = df['Age'].mean()
normal_range = df[(df['Age'] >= mean_age - std_dev) &
                  (df['Age'] <= mean_age + std_dev)]

Date Range Filtering

Filter by Date Range

import pandas as pd

df = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=10),
    'Sales': [100, 120, 150, 110, 90, 200, 180, 160, 140, 130]
})

# Filter between dates
start_date = '2024-01-05'
end_date = '2024-01-08'
filtered = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
print(filtered)

Recent Data Only

# Last 5 days
filtered = df[df['Date'] >= '2024-01-06']

# Specific month
filtered = df[df['Date'].dt.month == 1]

# Specific year
filtered = df[df['Date'].dt.year == 2024]

Time-Based Comparisons

# Last 7 days from today
from datetime import timedelta

recent = df[df['Date'] >= (pd.Timestamp.now() - timedelta(days=7))]

# Data from last month
last_month_start = pd.Timestamp.now() - timedelta(days=30)
filtered = df[df['Date'] >= last_month_start]

Using isin() for Multiple Values

Filter by Multiple Values

# Department is Sales OR IT OR HR
departments = ['Sales', 'IT', 'HR']
filtered = df[df['Department'].isin(departments)]

# More concise than:
# df[(df['Department'] == 'Sales') | (df['Department'] == 'IT') | (df['Department'] == 'HR')]

isin() with Numeric Values

# Ages 25, 30, or 35
filtered = df[df['Age'].isin([25, 30, 35])]

# Salary in specific amounts
filtered = df[df['Salary'].isin([50000, 60000, 70000])]

NOT isin() (Exclude Values)

# NOT in these departments
filtered = df[~df['Department'].isin(['HR', 'IT'])]

# Same as:
# df[(df['Department'] != 'HR') & (df['Department'] != 'IT')]
💡 Pro tip: Use isin() when filtering by multiple values. It’s cleaner and more efficient than multiple OR conditions.

Alternative Filter Methods

Using query() Method

# String-based filtering (more readable for complex conditions)
filtered = df.query('Age > 30 and Salary > 60000')

# Equivalent to:
# df[(df['Age'] > 30) & (df['Salary'] > 60000)]

Using filter() for Column Selection

# Filter columns, not rows (different use)
df.filter(['Name', 'Salary']) # Select columns

# Filter with regex
df.filter(regex='^S') # Columns starting with S

Using loc[] for Complex Filtering

# loc with conditions
filtered = df.loc[df['Age'] > 30]
filtered = df.loc[(df['Age'] > 30) & (df['Salary'] > 60000), ['Name', 'Salary']]

Using mask() for Inverted Logic

# mask() returns rows where condition is False (opposite of filtering)
result = df.mask(df['Age'] > 30) # Returns rows where Age <= 30

Performance Optimization

🚀 Filter Performance Tips

1. Use Simple Conditions First

# ✅ FAST - simple numeric comparison
df[df['Age'] > 30]

# ❌ SLOWER - string operations
df[df['Name'].str.contains('A')]

2. Combine Conditions in Single Filter

# ✅ FAST - single operation
filtered = df[(df['Age'] > 30) & (df['Salary'] > 60000)]

# ❌ SLOWER - multiple filters
filtered = df[df['Age'] > 30]
filtered = filtered[filtered['Salary'] > 60000]

3. Use query() for Complex Logic

# ✅ FAST and readable
df.query('Age > 30 and Salary > 60000 and Department != "HR"')

# ❌ SLOWER and harder to read
df[(df['Age'] > 30) & (df['Salary'] > 60000) & (df['Department'] != 'HR')]

4. Use isin() for Multiple Values

# ✅ FAST
df[df['Department'].isin(['Sales', 'IT', 'HR'])]

# ❌ SLOWER
df[(df['Department'] == 'Sales') | (df['Department'] == 'IT') | (df['Department'] == 'HR')]

Common Mistakes to Avoid

⚠️ Mistake #1: Using ‘and’ Instead of ‘&’

# ❌ WRONG - doesn't work with pandas
df[df['Age'] > 30 and df['Salary'] > 60000]

# ✅ CORRECT
df[(df['Age'] > 30) & (df['Salary'] > 60000)]

⚠️ Mistake #2: Forgetting Parentheses

# ❌ WRONG - operator precedence issue
df[df['Age'] > 30 & df['Salary'] > 60000]

# ✅ CORRECT
df[(df['Age'] > 30) & (df['Salary'] > 60000)]

⚠️ Mistake #3: Not Assigning Result

# ❌ WRONG - doesn't modify df
df[df['Age'] > 30]

# ✅ CORRECT
filtered = df[df['Age'] > 30]
# OR
df = df[df['Age'] > 30]

⚠️ Mistake #4: String Comparison Case Sensitivity

# ❌ WRONG - case mismatch
df[df['Department'] == 'it'] # Returns nothing if stored as 'IT'

# ✅ CORRECT
df[df['Department'].str.lower() == 'it']
# OR
df[df['Department'].str.upper() == 'IT']

Filtering Mastery

You now understand pandas filtering comprehensively:

  • Boolean indexing: Create and apply True/False masks
  • Single conditions: Numeric, string, and null checks
  • Multiple conditions: AND (&), OR (|), NOT (~) logic
  • String filtering: Contains, startswith, regex patterns
  • Numeric filtering: Ranges, comparisons, statistical
  • Date filtering: Time-based selection
  • isin(): Multiple value matching
  • Alternative methods: query(), loc[], mask()

Next step: Practice with your own datasets to build intuition for creating complex filters!

📚 Learn more pandas tutorials at Pandas How-To – Your complete guide to data analysis in Python

Related articles: loc, iloc, Boolean Indexing, Data Selection, Query Method

Leave a Reply