Duplicate records are common in large volumes of data. In this post you will learn how to count number of duplicates in Pandas.
When you start working with a new data table, you should familiarize yourself with its contents. You are interested in aspects such as: indexes, data types, table size, number of records, number of columns.
The number of duplicates is also important. When there are a lot of duplicates, you have to deal with them. If it’s only a few records at most, and the table is huge, you can usually neglect it.
How to count number of duplicates in Pandas
I wrote a code that counts the number of duplicates in a data table in Pandas. I would like to share with you the code to count the number of repeated lines in Pandas.
Let’s start by creating an example dataframe.
import pandas as pd import seaborn as sns my_df = sns.load_dataset('titanic')
I imported a dataframe containing the passengers from the Titanic disaster, which is part of the Seaborn module.
As you probably know, the drop_duplicates function is used to remove duplicates. I use it to count the number of repeated lines.
import pandas as pd import seaborn as sns my_df = sns.load_dataset('titanic') print(f"Number of duplicates: " + str(len(my_df)-len(my_df.drop_duplicates())))
The number of duplicates is the difference between the total number of rows and the number of rows after removing duplicates.
See also:
Documentation of drop_duplicates method
How to calculate cumulative sum in Pandas
How to count specific value in column
How to drop duplicates
How to find duplicates in Excel
How to resolve ValueError: Index has duplicate keys error?
2 thoughts on “How to count number of duplicates”