How to use isin in Pandas

When you’re working with datasets, you often need to find rows or elements where the values belong to a predefined group. Maybe you want to select customers from certain cities, or identify products from a specific category. That’s exactly where isin() in pandas shines.

What does isin() do?

The isin() method in pandas is used to check if each element in a Series or DataFrame column is contained within a set of given values. This set of values could be a list, a tuple, a Series, or even another DataFrame column.

isin() returns a boolean Series or DataFrame of the same shape as the original object. Each element in the resulting boolean object is True if the corresponding element in the original object is found within the provided set of values, and False otherwise.

Basic Usage with a Series

Let’s start with a Series example to understand the basics:

import pandas as pd

colors = pd.Series(['Red', 'Green', 'Blue', 'Yellow', 'Green', 'Purple'])
print("Original Series:\n", colors)

# Check if each color is in the list ['Red', 'Blue']
check_colors = colors.isin(['Red', 'Blue'])
print("\nIs in ['Red', 'Blue']:\n", check_colors)

This will produce:

Original Series:
0 Red
1 Green
2 Blue
3 Yellow
4 Green
5 Purple
dtype: object

Is in ['Red', 'Blue']::
0 True
1 False
2 True
3 False
4 False
5 False
dtype: bool

As you can see, isin([‘Red’, ‘Blue’]) checked each color in the colors Series. It returned True for ‘Red’ and ‘Blue’, and False for ‘Green’, ‘Yellow’, and ‘Purple’ because they are not in the list [‘Red’, ‘Blue’].

Using isin() for Filtering DataFrames

The real power of isin() comes out when you use it to filter DataFrames. Let’s create a DataFrame and see how it works:

data = {'Product': ['Apple', 'Banana', 'Orange', 'Grapes', 'Kiwi', 'Strawberry'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Fruit', 'Fruit'],
'Region': ['North', 'South', 'East', 'West', 'North', 'South'],
'Price': [1.2, 0.8, 0.9, 1.5, 1.8, 2.0]}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

Now, let’s say we want to find all products that are from the ‘North’ or ‘West’ regions. We can use isin() like this:

regions_to_select = ['North', 'West']
region_filter = df['Region'].isin(regions_to_select)
print("\nBoolean Series for regions in ['North', 'West']:\n", region_filter)

filtered_df = df[region_filter] # Use boolean Series to filter the DataFrame
print("\nFiltered DataFrame (Regions 'North' or 'West'):\n", filtered_df)

Output:

Original DataFrame:
Product Category Region Price
0 Apple Fruit North 1.2
1 Banana Fruit South 0.8
2 Orange Fruit East 0.9
3 Grapes Fruit West 1.5
4 Kiwi Fruit North 1.8
5 Strawberry Fruit South 2.0

Boolean Series for regions in ['North', 'West']:
0 True
1 False
2 False
3 True
4 True
5 False
Name: Region, dtype: bool

Filtered DataFrame (Regions 'North' or 'West'):
Product Category Region Price
0 Apple Fruit North 1.2
3 Grapes Fruit West 1.5
4 Kiwi Fruit North 1.8

Here, df[‘Region’].isin(regions_to_select) created a boolean Series indicating which rows have ‘Region’ values in [‘North’, ‘West’]. Then, we used this boolean Series region_filter to index the DataFrame df, effectively selecting only the rows where the condition is True.

Using isin() with Multiple Columns (for DataFrame)

For DataFrames, you can also apply isin() to multiple columns at once, but you need to provide a DataFrame as the values argument, where column names should match. Let’s say we want to find products that are either ‘Apple’ and from ‘North’, or ‘Banana’ and from ‘South’.

search_values = pd.DataFrame({'Product': ['Apple', 'Banana'],
'Region': ['North', 'South']})

multi_column_filter = df[['Product', 'Region']].isin(search_values)
print("\nBoolean DataFrame for multi-column isin:\n", multi_column_filter)

# Note: This doesn't filter exactly as intended in the example description.
# It checks if 'Apple' OR 'North' is present in each row of 'Product' and 'Region' columns.
# It's NOT checking for the paired condition ('Apple' AND 'North') OR ('Banana' AND 'South').

# To achieve the paired condition filtering, you need a different approach (e.g., using loc with conditions)

Output:

Boolean DataFrame for multi-column isin::
Product Region
0 True True
1 True True
2 False False
3 False False
4 False True
5 True True

As illustrated above, using a DataFrame with isin() for multiple columns checks for membership in either column. It doesn’t directly check for combined conditions across columns in a row-wise fashion. For row-wise combined conditions, you’d typically use boolean operators (&, |) with direct comparisons.

The invert=True Parameter: Finding Values NOT In the Set

Sometimes, you need to do the opposite – find values that are not in a specific set. That’s where the invert=True parameter comes in.

print("\nColors NOT in ['Red', 'Blue']:\n", colors.isin(['Red', 'Blue'], invert=True))

Output:

Colors NOT in ['Red', 'Blue']::
0 False
1 True
2 False
3 True
4 True
5 True
dtype: bool

With invert=True, isin() now returns True for values not in [‘Red’, ‘Blue’] and False for values that are in the set.

Key Takeaways about isin()

  • pandas.Series.isin() and pandas.DataFrame.isin() are used to check if elements are present in a given set of values.
  • It returns boolean Series or DataFrames, which are perfect for filtering data.
  • It’s efficient for selecting data based on value membership in lists, tuples, Series, or even other DataFrames (with column matching considerations).
  • Use invert=True to find elements that are not in the specified set.

Leave a Reply