There are several methods to remove outliers in Pandas, here are a few commonly used techniques:
Z-Score Method
Calculate the z-score of each data point, and remove those with a z-score beyond a certain threshold. Z-score is a measure of how many standard deviations a data point is away from the mean.
import pandas as pd import numpy as np df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'B': [10, 20, 30, 40, 50, 60, 70, 80, 90]}) # Calculate z-score for each column z_scores = np.abs((df - df.mean()) / df.std()) # Set the threshold for outlier removal threshold = 3 # Remove rows with z-scores greater than the threshold df = df[(z_scores < threshold).all(axis=1)]
IQR Method
Calculate the interquartile range (IQR) of the data, and remove those data points that fall below Q1 – 1.5IQR or above Q3 + 1.5IQR. IQR is a measure of the spread of the middle 50% of the data.
import pandas as pd import numpy as np df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'B': [10, 20, 30, 40, 50, 60, 70, 80, 90]}) # Calculate Q1, Q3, and IQR for each column Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 - Q1 # Set the threshold for outlier removal threshold = 1.5 # Remove rows with values outside of the threshold df = df[~((df < (Q1 - threshold * IQR)) | (df > (Q3 + threshold * IQR))).any(axis=1)]
Percentile method
This method involves calculating the percentiles of your data and removing those that fall outside a specified range. Here’s an example code snippet to remove the outliers using the percentile method:
# create a DataFrame with some data df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}) # calculate the 5th and 95th percentile of the data p5 = df['col1'].quantile(0.05) p95 = df['col1'].quantile(0.95) # remove outliers that fall outside the 5th and 95th percentile df = df[(df['col1'] >= p5) & (df['col1'] <= p95)]
Note that each of these methods has its own strengths and weaknesses, and the method you choose will depend on your specific data and goals. Additionally, it’s important to exercise caution when removing outliers, as they may contain valuable information or indicate errors in your data.
Pingback: How To Use Pandas For Machine Learning • Pandas How To