How to use where in Pandas

When working with datasets in Pandas, you often need to perform actions based on conditions. Perhaps you want to replace certain values if they meet a specific criteria, or maybe you want to isolate portions of your data for deeper analysis. That’s where the where method in Pandas becomes incredibly valuable.

Think of where as a conditional mask for your Series or DataFrame. It allows you to apply a condition and selectively keep values where that condition is true, and replace values where it’s false. This isn’t just about filtering rows or columns; it’s about manipulating the values themselves based on a logical test.

How where Works

The core of where is a condition. This condition can be:

  • A boolean Series or DataFrame of the same shape as the data you’re applying it to.
  • A boolean array-like object (like a NumPy array).
  • A callable function that evaluates to a boolean.

When you apply .where(condition) to a Pandas object, it checks the condition at each element.

  • If the condition is TRUE: The original value at that position is kept.
  • If the condition is FALSE: The value is replaced with something else. By default, it’s replaced with NaN (Not a Number), signifying a missing value. You can customize this replacement value using the `other` parameter.

Code in Action

Let’s see where in a practical context. Imagine you have sales data and you want to highlight sales figures that are below a certain threshold.

import pandas as pd

data = {'Product': ['A', 'B', 'C', 'D', 'E'],
'Sales': [150, 220, 80, 300, 190]}
df = pd.DataFrame(data)

print("Original DataFrame:\n", df)

# Condition: Sales greater than or equal to 100
condition = df['Sales'] >= 100

# Apply where - keep sales above 100, replace others with NaN
df['Sales_Adjusted'] = df['Sales'].where(condition)

print("\nDataFrame after applying 'where':\n", df)

Output:

Original DataFrame:
Product Sales
0 A 150
1 B 220
2 C 80
3 D 300
4 E 190

DataFrame after applying 'where':
Product Sales Sales_Adjusted
0 A 150 150.0
1 B 220 220.0
2 C 80 NaN
3 D 300 300.0
4 E 190 190.0

In this example, where(condition) on the Sales column kept the original sales values where Sales >= 100 was true. For sales below 100, they were replaced with NaN.

Customizing the Replacement with other

Instead of NaN, you can specify a different replacement value using the other parameter. Let’s replace sales below 100 with 0:

df['Sales_Adjusted_Zero'] = df['Sales'].where(condition, other=0)
print("\nDataFrame with 'other' parameter (0):\n", df)

Output:

DataFrame with 'other' parameter (0):
Product Sales Sales_Adjusted Sales_Adjusted_Zero
0 A 150 150.0 150
1 B 220 220.0 220
2 C 80 NaN 0
3 D 300 300.0 300
4 E 190 190.0 190

Now, sales below 100 are explicitly shown as 0 in the Sales_Adjusted_Zero column.

Working with DataFrames

where isn’t limited to Series; you can apply it to entire DataFrames. The condition should then be a DataFrame of the same shape or a Series that can be broadcast.

df_original = df[['Sales']].copy() # Working on a copy to show DataFrame where
print("\nOriginal Sales DataFrame:\n", df_original)

df_masked = df_original.where(df_original > 200, other="Low Sales")
print("\nDataFrame 'where' applied:\n", df_masked)

Output:

Original Sales DataFrame:
Sales
0 150
1 220
2 80
3 300
4 190

DataFrame 'where' applied:
Sales
0 Low Sales
1 220
2 Low Sales
3 300
4 Low Sales

Here, we applied where to the entire df_original DataFrame. Where df_original > 200 was false (meaning Sales were 200 or less), the value was replaced with the string “Low Sales”.

Key Takeaways

  • pandas.where is for conditional value replacement based on a condition.
  • It’s different from filtering rows or columns; it manipulates values at specific locations.
  • The other parameter gives you control over the replacement value, beyond the default NaN.
  • It works seamlessly with both Series and DataFrames, offering flexibility in your data manipulations.

Leave a Reply