Grouping by in pandas is a process of splitting the data into groups based on some criteria and applying a function to each group independently. It’s a powerful operation that can be used for data analysis and data aggregation.
Here’s how you can perform grouping by in pandas:
Load data into a pandas DataFrame
import pandas as pd # load the data into a pandas DataFrame df = pd.read_csv('data.csv')
Group the data by a single column
# group the data by column 'col1' grouped = df.groupby('col1')
Apply a function to each group
# apply a mean function to each group mean_by_group = grouped.mean()
In the above example, the data is grouped by the values in column ‘col1’ and the mean function is applied to each group. The resulting DataFrame mean_by_group will have the same number of rows as unique values in column ‘col1’ and the mean of each column in each group.
Group the data by multiple columns
# group the data by multiple columns 'col1' and 'col2' grouped = df.groupby(['col1', 'col2'])
In this example, the data is grouped by the combination of values in columns ‘col1’ and ‘col2’. The resulting DataFrame will have the same number of rows as unique combinations of values in columns ‘col1’ and ‘col2’.
Apply multiple functions to each group
# apply multiple functions to each group agg_by_group = grouped.agg({'col3': ['mean', 'sum'], 'col4': ['max', 'min']})
In this example, multiple functions (mean, sum, max, and min) are applied to each group. The resulting DataFrame agg_by_group will have the same number of rows as unique combinations of values in columns ‘col1’ and ‘col2’, and multiple columns for each function applied.
Custom function for grouping
# define a custom function for grouping def custom_func(x): return x.sum() # apply the custom function to each group custom_by_group = grouped['col3'].apply(custom_func)
In this example, a custom function custom_func is defined and applied to each group. The function takes a pandas Series as input and returns the sum of its values. The resulting DataFrame custom_by_group will have the same number of rows as unique combinations of values in columns ‘col1’ and ‘col2’ and one column for the result of the custom function.
These are the basic steps to perform grouping by in pandas. You can use these steps to perform complex data analysis and data aggregation tasks. Additionally, there are many other functions and methods available in pandas to perform grouping by, such as count, sum, min, max, mean, median, first, last, etc. You can use these functions directly on the grouped DataFrame without applying the agg method.