Pandas groupby(): Complete Guide with Examples

The groupby() function is one of the most powerful and frequently used methods in Pandas. It allows you to split a DataFrame into groups based on one or more columns, apply operations to each group independently, and combine the results back together. This split-apply-combine workflow is essential for data analysis, aggregation, and summarization tasks.

Understanding groupby()

The groupby() function splits your DataFrame into logical groups based on specified column(s). Each group contains all rows that share the same value(s) in the grouping column(s). You can then apply aggregation functions to compute statistics or transformations for each group.

The Split-Apply-Combine Workflow

  1. Split: Divide the DataFrame into groups based on one or more columns
  2. Apply: Apply a function to each group independently
  3. Combine: Merge the results back together into a single output

Basic Syntax and Simple Example

The fundamental syntax for using groupby is straightforward:

df.groupby('column_name').aggregation_function()

Simple Example

import pandas as pd

# Create sample data
df = pd.DataFrame({
    'Product': ['A', 'B', 'A', 'B', 'A'],
    'Sales': [100, 150, 200, 120, 180],
    'Region': ['North', 'South', 'North', 'South', 'West']
})

# Group by Product and calculate mean sales
result = df.groupby('Product')['Sales'].mean()
print(result)
# Output:
# Product
# A    160.0
# B    135.0

Grouping by Single Column

Common Aggregation Functions

Pandas provides several built-in aggregation functions that work directly with groupby:

# Sum of sales by product
df.groupby('Product')['Sales'].sum()

# Count of records by product
df.groupby('Product')['Sales'].count()

# Minimum sales by product
df.groupby('Product')['Sales'].min()

# Maximum sales by product
df.groupby('Product')['Sales'].max()

# Mean sales by product
df.groupby('Product')['Sales'].mean()

# Standard deviation by product
df.groupby('Product')['Sales'].std()

Comprehensive Statistics with describe()

The describe() method provides a complete statistical summary for each group:

df.groupby('Product')['Sales'].describe()

This returns count, mean, std, min, 25%, 50%, 75%, and max for each group—all in one operation.

Grouping by Multiple Columns

To group by multiple columns, pass a list of column names to groupby. This creates groups for each unique combination of values:

# Group by Product and Region
result = df.groupby(['Product', 'Region'])['Sales'].sum()

# Get group size
result = df.groupby(['Product', 'Region']).size()

Advanced Aggregation Techniques

The agg() Method for Multiple Aggregations

The agg() method provides flexibility to apply multiple aggregation functions simultaneously:

# Multiple aggregations on single column
df.groupby('Product')['Sales'].agg(['sum', 'mean', 'count'])

# Multiple aggregations with custom names
df.groupby('Product')['Sales'].agg(
    total_sales=('Sales', 'sum'),
    avg_sales=('Sales', 'mean'),
    num_records=('Sales', 'count')
)

# Different aggregations on different columns
df.groupby('Product').agg({
    'Sales': ['sum', 'mean'],
    'Region': 'count'
})

Custom Aggregation with Lambda Functions

For more complex calculations, use lambda functions within agg():

# Calculate custom metric (range)
df.groupby('Product')['Sales'].agg(lambda x: x.max() - x.min())

# Multiple custom functions
df.groupby('Product')['Sales'].agg([
    ('range', lambda x: x.max() - x.min()),
    ('cv', lambda x: x.std() / x.mean())  # Coefficient of variation
])

Accessing and Inspecting GroupBy Objects

Get Number of Groups

grouped = df.groupby('Product')
print(grouped.ngroups)  # Returns: 2

Iterate Through Groups

grouped = df.groupby('Product')
for name, group in grouped:
    print(f"Group: {name}")
    print(group)  # DataFrame for this group

Get Specific Group

grouped = df.groupby('Product')
group_a = grouped.get_group('A')

View Group Information

grouped = df.groupby('Product')
print(grouped.groups)   # Dictionary of group names and indices
print(grouped.size())   # Size of each group
print(grouped.indices)  # Positional indices of rows in each group

Filtering Groups

Filter groups based on their characteristics using the filter() method:

# Keep only groups with more than 2 records
df.groupby('Product').filter(lambda x: len(x) > 2)

# Keep only groups with average sales greater than 150
df.groupby('Product').filter(lambda x: x['Sales'].mean() > 150)

# Keep groups where product appears in specific regions
df.groupby('Product').filter(lambda x: 'North' in x['Region'].values)

Transform Operations

The transform() method applies a function to each group while maintaining the original DataFrame structure:

# Add a column with group mean
df['Sales_Mean'] = df.groupby('Product')['Sales'].transform('mean')

# Standardize sales within each product group (z-score normalization)
df['Sales_Std'] = df.groupby('Product')['Sales'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Calculate percentage of group total
df['Pct_of_Group'] = (df.groupby('Product')['Sales']
                       .transform(lambda x: (x / x.sum()) * 100))

Handling Missing Data in groupby()

By default, groupby() excludes NaN values from groups. To include them in the grouping:

# Include NaN as a separate group
df.groupby('Product', dropna=False)['Sales'].sum()

Sorting Groups

Control whether groups are sorted by using the sort parameter:

# Sort groups by key (default: True)
df.groupby('Product', sort=True)['Sales'].mean()

# Don't sort groups (faster for large datasets)
df.groupby('Product', sort=False)['Sales'].mean()

Real-World Examples

Example 1: Sales Analysis by Category

Calculate multiple statistics for sales by product category:

sales_summary = df.groupby('Product').agg({
    'Sales': ['sum', 'mean', 'count', 'std'],
    'Region': 'nunique'
}).round(2)

print(sales_summary)

Example 2: Time Series Analysis

Group daily sales data and calculate statistics by date:

df['Date'] = pd.to_datetime(df['Date'])

# Group by date and calculate daily statistics
daily_stats = df.groupby(df['Date'].dt.date).agg({
    'Sales': ['sum', 'mean', 'count'],
    'Quantity': 'sum',
    'Price': 'mean'
})

Example 3: Multi-Level Grouping with Filtering

Group by multiple columns, then filter high-value groups:

result = (df.groupby(['Product', 'Region'])['Sales']
          .sum()
          .reset_index(name='Total_Sales')
          .query('Total_Sales > 300'))

print(result)

Performance Best Practices

1. Use Appropriate Data Types

Convert columns to categorical if they represent fixed categories to reduce memory usage:

df['Product'] = df['Product'].astype('category')
result = df.groupby('Product')['Sales'].sum()

2. Use as_index Parameter

Keep grouping columns as regular columns instead of the index for better readability:

df.groupby('Product', as_index=False)['Sales'].sum()

3. Select Required Columns Early

Filter to only the columns you need before grouping:

# Good practice
df[['Product', 'Sales']].groupby('Product')['Sales'].sum()

# Less efficient
df.groupby('Product')['Sales'].sum()

4. Use observed=True with Categorical Data

Exclude unused categories from the result:

df['Product'] = df['Product'].astype('category')
df.groupby('Product', observed=True)['Sales'].sum()

Common Pitfalls and Solutions

Pitfall 1: NaN Values Excluded from Groups

Problem: By default, NaN values are excluded from groupby operations.
Solution: Use dropna=False to include NaN as a group.

Pitfall 2: Unexpected Result Index

Problem: The grouped column becomes the index, which may cause issues in subsequent operations.
Solution: Use reset_index() to convert the index back to a regular column.
result = df.groupby('Product')['Sales'].mean().reset_index()

Pitfall 3: Loss of Column Information

Problem: When grouping on specific columns, other columns are dropped from the result.
Solution: Explicitly select all columns you need in the aggregation.

Comparison with Similar Functions

Method Best Use Case Output Format
groupby() Group data and apply aggregations Series or DataFrame
pivot_table() Create cross-tabulations with aggregations DataFrame (pivoted)
resample() Group time-series data by time intervals Series or DataFrame
cut() Group continuous data into bins Categorical bins

Key Takeaways

Master the fundamentals:

  • Use groupby() to split data into logical groups
  • Apply aggregation functions to compute statistics for each group
  • Combine with agg(), transform(), and filter() for advanced operations
  • Always consider data types and missing values when grouping
  • Use proper indexing and memory optimization techniques for large datasets

Start applying these techniques to your data projects and unlock powerful insights through group-based analysis!

Leave a Reply