How to remove outliers in Pandas

There are several methods to remove outliers in Pandas, here are a few commonly used techniques:

Z-Score Method

Calculate the z-score of each data point, and remove those with a z-score beyond a certain threshold. Z-score is a measure of how many standard deviations a data point is away from the mean.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'B': [10, 20, 30, 40, 50, 60, 70, 80, 90]})
# Calculate z-score for each column
z_scores = np.abs((df - df.mean()) / df.std())

# Set the threshold for outlier removal
threshold = 3

# Remove rows with z-scores greater than the threshold
df = df[(z_scores < threshold).all(axis=1)]

IQR Method

Calculate the interquartile range (IQR) of the data, and remove those data points that fall below Q1 – 1.5IQR or above Q3 + 1.5IQR. IQR is a measure of the spread of the middle 50% of the data.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 9],
'B': [10, 20, 30, 40, 50, 60, 70, 80, 90]})
# Calculate Q1, Q3, and IQR for each column
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

# Set the threshold for outlier removal
threshold = 1.5

# Remove rows with values outside of the threshold
df = df[~((df < (Q1 - threshold * IQR)) | (df > (Q3 + threshold * IQR))).any(axis=1)]

Percentile method

This method involves calculating the percentiles of your data and removing those that fall outside a specified range. Here’s an example code snippet to remove the outliers using the percentile method:

# create a DataFrame with some data
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

# calculate the 5th and 95th percentile of the data
p5 = df['col1'].quantile(0.05)
p95 = df['col1'].quantile(0.95)

# remove outliers that fall outside the 5th and 95th percentile
df = df[(df['col1'] >= p5) & (df['col1'] <= p95)]

Note that each of these methods has its own strengths and weaknesses, and the method you choose will depend on your specific data and goals. Additionally, it’s important to exercise caution when removing outliers, as they may contain valuable information or indicate errors in your data.

This Post Has One Comment

Leave a Reply