Pandas Data validation

Data validation is an essential step in any data analysis or machine learning project. It involves checking data quality, consistency, and correctness to ensure that the data is reliable and suitable for the intended analysis or modeling. Pandas provides several functions and tools for data validation, such as checking for missing values, checking for duplicates, checking data types, and more. Here are some common data validation tasks in Pandas:

Checking for missing values

Use the isna function to check for missing values in a dataframe or series, and use the sum function to count the number of missing values in each column.

import pandas as pd

# create a dataframe with missing values
df = pd.DataFrame({'A': [1, 2, pd.NA, 4], 'B': [5, pd.NA, pd.NA, 8]})

# check for missing values
print(df.isna())

# count missing values by column
print(df.isna().sum())

Checking for duplicates

Use the duplicated function to check for duplicate rows in a dataframe, and use the drop_duplicates function to remove duplicate rows.

# create a dataframe with duplicate rows
df = pd.DataFrame({'A': [1, 2, 2, 3], 'B': [4, 5, 5, 6]})# check for duplicate rows
print(df.duplicated())

# remove duplicate rows
df.drop_duplicates(inplace=True)

Checking data types

Use the dtypes attribute to check the data types of columns in a dataframe, and use the astype function to convert columns to a specified data type.

# create a dataframe with mixed data types
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['4', '5', '6']})# check data types of columns
print(df.dtypes)

# convert column B to integer data type
df['B'] = df['B'].astype(int)

Checking data range

Use the describe function to check the summary statistics of a dataframe or series, such as mean, min, max, and quartiles.

# create a dataframe with numeric data
df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [100, 200, 300, 400, 500]})# check summary statistics
print(df.describe())

These are just a few examples of data validation tasks in Pandas. There are many more functions and tools available in Pandas and other Python libraries for various data validation tasks. Check out the pandas documentation and more resources on pandashowto.com for more information.

Leave a Reply