How to Drop Duplicates in Pandas

In this article, you will learn how to drop duplicates in Pandas using the drop_duplicates() function.

The drop_duplicates() function

The drop_duplicates() function is a built-in function in Pandas that is used to remove duplicate rows from a DataFrame. The function takes several arguments, but the most important ones are:

  • subset: This argument specifies the columns on which the duplicates should be checked. If this argument is not specified, then all columns will be checked for duplicates.
  • keep: This argument specifies which rows should be kept when duplicates are found. The default value is first, which means that the first occurrence of each duplicate row will be kept. The other possible value is last, which means that the last occurrence of each duplicate row will be kept.
  • inplace: This argument specifies whether the changes should be made to the original DataFrame. The default value is False, which means that a copy of the DataFrame will be created and the changes will be made to the copy.

Example

The following code shows how to use the drop_duplicates() function to remove duplicate rows from a DataFrame:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': [4, 5, 5, 6, 7]})

# Drop duplicates
df = df.drop_duplicates()

print(df)

As you can see, the duplicate rows have been removed from the DataFrame.

How to Drop Duplicates in Specific Columns in Pandas

In addition to dropping duplicates from all columns, you can also drop duplicates from specific columns. To do this, you can use the subset argument of the drop_duplicates() function. The subset argument takes a list of column names, and the duplicates will only be checked for in these columns.

For example, the following code drops duplicates from the A and B columns:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': [4, 5, 5, 6, 7]})

# Drop duplicates in columns A and B
df = df.drop_duplicates(subset=['A', 'B'])

print(df)

The duplicate rows have been removed from the DataFrame, but only for the A and B columns.

You can also use the keep argument to specify which rows should be kept when duplicates are found. The default value is first, which means that the first occurrence of each duplicate row will be kept. The other possible value is last, which means that the last occurrence of each duplicate row will be kept.

For example, the following code keeps the last occurrence of each duplicate row in the A and B columns:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': [4, 5, 5, 6, 7]})

# Keep last occurrence of duplicates in columns A and B
df = df.drop_duplicates(subset=['A', 'B'], keep='last')

print(df)

The duplicate rows have been removed from the DataFrame, but the last occurrence of each duplicate row has been kept.

In this article, you learned how to drop duplicates in Pandas using the drop_duplicates() function. This is a useful function for cleaning up your data and removing unnecessary rows.

Documentation of the drop_duplicates function: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html 

This Post Has 5 Comments

Leave a Reply