As a data scientist, I tell you that the first thing to do when working with data is to clear your dataset. You have to make the data unique when necessary, so you need to learn how to remove duplicates in Pandas.
Pandas offers a dedicated drop_duplicates function which you use to drop duplicates from the dataframe.
Pandas drop_duplicates
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 2], "B": [4, 5, 5]})
df = df.drop_duplicates(keep=False, inplace=False)
Remember to use two parameters:
- keep to don’t display duplicates anymore
- inplace to actually save the change
By default, the first occurrence of each duplicated row is kept and subsequent duplicates are dropped. To keep the last occurrence of each duplicated row, you can specify keep=’last’:
df = df.drop_duplicates(keep=’last’)
Documentation of the drop_duplicates function: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
2 thoughts on “How to drop duplicates”