Pandas query(): Efficient DataFrame Filtering

If you filter DataFrames with chained comparisons or boolean masks, you’ll love df.query(). It offers clean syntax, improved performance on complex filters, and supports inline variables.

Why Use query()?

  • Cleaner syntax: Write expressions like SQL where clauses
  • Fewer errors: No need for bracket-heavy filters
  • Performance boosts: Evaluated with numexpr by default
  • Variable support: Easily insert Python variables using “@”

Basic Syntax Example

import pandas as pd

df = pd.DataFrame({
  'age': [25, 32, 47, 51],
  'city': ['NY', 'LA', 'NY', 'SF'],
  'score': [88, 92, 75, 85]
})

# Filter age > 30 and score >= 80
df.query('age > 30 & score >= 80')

Result: Rows where age is over 30 and score at least 80.

Using Variables in Queries

min_score = 80
target_cities = ['NY', 'SF']
df.query('score >= @min_score and city in @target_cities')

Speed Advantage Over Boolean Masks

Under the hood, query() uses numexpr, making it faster than:

df[(df['age'] > 30) & (df['score'] >= 80)]

Complex Logic Example

df.query('(city == "NY" or city == "SF") and score != 85')

When query() May Not Work

  • Column names with spaces or special chars (non-identifier)
  • Reserved Python keywords — use backticks, e.g. `class`
  • MultiIndex columns — need different handling

Comparison Table: query() vs loc / boolean indexing

Feature query() df.loc / boolean filters
Readability High (SQL-style) Medium–Low (verbose)
Performance Better with numexpr Standard Python evaluation
Variable injection Yes (@var) No direct support
Complex logic Yes Yes
Edge cases Columns must be valid identifiers Works for any column name

When to Use query()

  • SQL-style filtering improves readability
  • Using variables inside filters
  • Working with large DataFrames and needing speed

query() offers a neat, fast, SQL-like way to filter DataFrames, with support for variables and improved speed. Use it for cleaner code and more efficient filtering in Pandas!

Leave a Reply