Grouping data is one of the most fundamental operations in data analysis. Whether you need total sales by region and product, average salary by department and job level, or order counts by customer and month — grouping by multiple columns is how you answer those questions.
Pandas groupby() is the tool that makes this possible and once you understand how it works, it becomes one of the most reached-for functions in your entire data science toolkit.
In this guide, we will walk through everything you need to know about grouping data by multiple columns in pandas from basic syntax to advanced aggregation patterns with real examples.
Setting Up the Example Dataset
python
import pandas as pd
import numpy as np
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve',
'Frank', 'Grace', 'Henry', 'Iris', 'Jack'],
'department': ['Engineering', 'Marketing', 'Engineering', 'HR',
'Sales', 'Marketing', 'Engineering', 'HR',
'Sales', 'Engineering'],
'region': ['North', 'South', 'North', 'East',
'West', 'South', 'North', 'East',
'West', 'South'],
'job_level': ['Senior', 'Junior', 'Mid', 'Senior',
'Junior', 'Mid', 'Senior', 'Junior',
'Mid', 'Senior'],
'salary': [95000, 62000, 78000, 88000, 55000,
71000, 92000, 58000, 67000, 89000],
'bonus': [12000, 5000, 8000, 10000, 4000,
7000, 11000, 5500, 6500, 9000],
'years_exp': [8, 3, 5, 7, 2, 4, 9, 3, 5, 7]
}
df = pd.DataFrame(data)
print(df.head())
Output:
| name | department | region | job_level | salary | bonus | years_exp |
|---|---|---|---|---|---|---|
| Alice | Engineering | North | Senior | 95000 | 12000 | 8 |
| Bob | Marketing | South | Junior | 62000 | 5000 | 3 |
| Charlie | Engineering | North | Mid | 78000 | 8000 | 5 |
| Diana | HR | East | Senior | 88000 | 10000 | 7 |
| Eve | Sales | West | Junior | 55000 | 4000 | 2 |
Basic groupby() With a Single Column
Before grouping by multiple columns, understand the single-column version.
python
# Group by one column and sum salaries
result = df.groupby('department')['salary'].sum()
print(result)
Output:
department
Engineering 354000
HR 146000
Marketing 133000
Sales 122000
Name: salary, dtype: int64
This gives total salary per department. Now let us extend this to multiple columns.
groupby() With Multiple Columns — The Core Pattern
Pass a list of column names to groupby() to group by multiple columns simultaneously.
python
# Group by two columns
result = df.groupby(['department', 'region'])['salary'].sum()
print(result)
Output:
department region
Engineering North 265000
South 89000
HR East 146000
Marketing South 133000
Sales West 122000
Name: salary, dtype: int64
The result has a MultiIndex — a hierarchical index with both department and region as levels. Each unique combination of department + region becomes one group.
Resetting the Index — Flat DataFrame Output
The MultiIndex output can be difficult to work with. Use reset_index() to convert back to a flat DataFrame.
python
result = df.groupby(['department', 'region'])['salary'].sum().reset_index()
print(result)
Output:
| department | region | salary |
|---|---|---|
| Engineering | North | 265000 |
| Engineering | South | 89000 |
| HR | East | 146000 |
| Marketing | South | 133000 |
| Sales | West | 122000 |
Now you have a clean, flat DataFrame where department and region are regular columns — easy to sort, filter, and export.
Best practice: Almost always chain .reset_index() after groupby() for cleaner output.
Grouping by Three or More Columns
The same pattern extends to any number of columns.
python
result = df.groupby(['department', 'region', 'job_level'])['salary'].mean().reset_index()
result['salary'] = result['salary'].round(0)
print(result)
Output:
| department | region | job_level | salary |
|---|---|---|---|
| Engineering | North | Mid | 78000 |
| Engineering | North | Senior | 93500 |
| Engineering | South | Senior | 89000 |
| HR | East | Junior | 58000 |
| HR | East | Senior | 88000 |
| Marketing | South | Junior | 62000 |
| Marketing | South | Mid | 71000 |
| Sales | West | Junior | 55000 |
| Sales | West | Mid | 67000 |
Three-level grouping gives you the average salary for every unique combination of department, region, and job level.
Common Aggregation Functions
sum() — Total Values
python
dept_region_salary = df.groupby(['department', 'region'])['salary'].sum().reset_index()
dept_region_salary.columns = ['department', 'region', 'total_salary']
print(dept_region_salary)
mean() — Average Values
python
avg_salary = df.groupby(['department', 'job_level'])['salary'].mean().reset_index()
avg_salary['salary'] = avg_salary['salary'].round(2)
print(avg_salary)
count() — Count Rows Per Group
python
headcount = df.groupby(['department', 'region'])['name'].count().reset_index()
headcount.columns = ['department', 'region', 'headcount']
print(headcount)
min() and max() — Range Analysis
python
salary_range = df.groupby(['department', 'job_level']).agg(
min_salary=('salary', 'min'),
max_salary=('salary', 'max')
).reset_index()
print(salary_range)
Aggregating Multiple Columns at Once
Pass a list of columns before calling the aggregation function to aggregate multiple columns simultaneously.
python
# Sum both salary and bonus grouped by department and region
result = df.groupby(['department', 'region'])[['salary', 'bonus']].sum().reset_index()
print(result)
Output:
| department | region | salary | bonus |
|---|---|---|---|
| Engineering | North | 265000 | 31000 |
| Engineering | South | 89000 | 9000 |
| HR | East | 146000 | 15500 |
| Marketing | South | 133000 | 12000 |
| Sales | West | 122000 | 10500 |
Named Aggregation With agg() — Most Flexible Pattern
The agg() method with named aggregation is the most powerful and readable way to apply multiple different aggregations to different columns in one step.
Basic Named Aggregation
python
result = df.groupby(['department', 'region']).agg(
total_salary=('salary', 'sum'),
avg_salary=('salary', 'mean'),
max_salary=('salary', 'max'),
total_bonus=('bonus', 'sum'),
headcount=('name', 'count'),
avg_experience=('years_exp', 'mean')
).reset_index()
result['avg_salary'] = result['avg_salary'].round(0)
result['avg_experience'] = result['avg_experience'].round(1)
print(result)
Output:
| department | region | total_salary | avg_salary | max_salary | total_bonus | headcount | avg_experience |
|---|---|---|---|---|---|---|---|
| Engineering | North | 265000 | 88333 | 95000 | 31000 | 3 | 7.3 |
| Engineering | South | 89000 | 89000 | 89000 | 9000 | 1 | 7.0 |
| HR | East | 146000 | 73000 | 88000 | 15500 | 2 | 5.0 |
| Marketing | South | 133000 | 66500 | 71000 | 12000 | 2 | 3.5 |
| Sales | West | 122000 | 61000 | 67000 | 10500 | 2 | 3.5 |
The syntax is:
python
new_column_name=('source_column', 'aggregation_function')
This is clean, readable, and produces exactly the column names you want without any renaming step.
Using Custom Functions in agg()
python
result = df.groupby(['department', 'job_level']).agg(
avg_salary=('salary', 'mean'),
salary_range=('salary', lambda x: x.max() - x.min()),
headcount=('name', 'count'),
total_comp=('salary', lambda x: x.sum() + df.loc[x.index, 'bonus'].sum())
).reset_index()
print(result)
Lambda functions inside agg() give you complete flexibility for custom calculations.
Sorting Results After Groupby
Chain sort_values() after reset_index() to sort the output.
python
# Top departments by total salary, then by region
result = (
df.groupby(['department', 'region'])['salary']
.sum()
.reset_index()
.sort_values('salary', ascending=False)
.reset_index(drop=True)
)
print(result)
Output:
| department | region | salary |
|---|---|---|
| Engineering | North | 265000 |
| HR | East | 146000 |
| Marketing | South | 133000 |
| Sales | West | 122000 |
| Engineering | South | 89000 |
Multi-Column Sort
python
result = (
df.groupby(['department', 'job_level'])['salary']
.mean()
.reset_index()
.sort_values(['department', 'salary'], ascending=[True, False])
)
print(result)
Filtering Groups With filter()
The filter() method keeps or removes entire groups based on a condition applied to the group.
python
# Keep only departments where average salary exceeds 75000
high_salary_depts = df.groupby('department').filter(
lambda x: x['salary'].mean() > 75000
)
print(high_salary_depts[['name', 'department', 'salary']])
python
# Keep only department-region combinations with more than 1 employee
multi_employee_groups = df.groupby(['department', 'region']).filter(
lambda x: len(x) > 1
)
print(multi_employee_groups[['name', 'department', 'region', 'salary']])
Adding Group Aggregations Back to the Original DataFrame
Use transform() to calculate group-level statistics and add them as new columns to the original DataFrame — keeping all rows intact.
python
# Add department average salary and department headcount to each row
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean').round(0)
df['dept_headcount'] = df.groupby('department')['name'].transform('count')
# Add multi-column group stats
df['dept_region_avg_salary'] = df.groupby(
['department', 'region']
)['salary'].transform('mean').round(0)
print(df[['name', 'department', 'region', 'salary',
'dept_avg_salary', 'dept_headcount',
'dept_region_avg_salary']])
Output:
| name | department | region | salary | dept_avg_salary | dept_headcount | dept_region_avg_salary |
|---|---|---|---|---|---|---|
| Alice | Engineering | North | 95000 | 88500 | 4 | 88333 |
| Bob | Marketing | South | 62000 | 66500 | 2 | 66500 |
| Charlie | Engineering | North | 78000 | 88500 | 4 | 88333 |
| Diana | HR | East | 88000 | 73000 | 2 | 73000 |
transform() is extremely useful for calculating things like “salary relative to department average” or “percentage of department total.”
python
# Salary as percentage of department total
df['pct_of_dept_salary'] = (
df['salary'] / df.groupby('department')['salary'].transform('sum') * 100
).round(1)
print(df[['name', 'department', 'salary', 'pct_of_dept_salary']])
Groupby With as_index=False — Skip reset_index()
Setting as_index=False prevents the group columns from becoming an index — giving you a flat DataFrame directly without needing reset_index().
python
# as_index=False is equivalent to chaining .reset_index()
result = df.groupby(
['department', 'region'],
as_index=False
)['salary'].sum()
print(result)
Both approaches produce the same result. as_index=False is slightly more concise. reset_index() is more explicit and widely used.
Real-World Use Cases
HR Analytics — Department and Level Summary
python
hr_summary = df.groupby(['department', 'job_level']).agg(
headcount=('name', 'count'),
avg_salary=('salary', 'mean'),
avg_bonus=('bonus', 'mean'),
avg_experience=('years_exp', 'mean')
).reset_index()
hr_summary['avg_salary'] = hr_summary['avg_salary'].round(0)
hr_summary['avg_bonus'] = hr_summary['avg_bonus'].round(0)
hr_summary['avg_experience'] = hr_summary['avg_experience'].round(1)
print(hr_summary.sort_values(['department', 'avg_salary'], ascending=[True, False]))
Sales Analysis — Region and Category Performance
python
# Simulated sales data
sales_data = df.copy()
sales_data['revenue'] = np.random.randint(50000, 200000, len(df))
sales_data['units'] = np.random.randint(100, 500, len(df))
sales_summary = sales_data.groupby(['region', 'department']).agg(
total_revenue=('revenue', 'sum'),
avg_revenue=('revenue', 'mean'),
total_units=('units', 'sum'),
transaction_count=('revenue', 'count')
).reset_index()
sales_summary['revenue_per_unit'] = (
sales_summary['total_revenue'] / sales_summary['total_units']
).round(2)
print(sales_summary.sort_values('total_revenue', ascending=False))
Compensation Benchmarking
python
compensation = df.groupby(['department', 'job_level']).agg(
headcount=('salary', 'count'),
min_salary=('salary', 'min'),
p25_salary=('salary', lambda x: x.quantile(0.25)),
median_salary=('salary', 'median'),
p75_salary=('salary', lambda x: x.quantile(0.75)),
max_salary=('salary', 'max')
).reset_index()
print(compensation)
Comparison Table: Groupby Output Options
| Method | Returns | Index Type | Best For |
|---|---|---|---|
groupby().sum() | Series or DataFrame | MultiIndex | Quick single-metric summaries |
groupby().sum().reset_index() | Flat DataFrame | Integer | Most reporting use cases |
groupby(as_index=False).sum() | Flat DataFrame | Integer | Same as above, slightly shorter |
groupby().agg(name=(...)) | DataFrame | MultiIndex | Multiple metrics, clean names |
groupby().agg().reset_index() | Flat DataFrame | Integer | Production reporting |
groupby().transform() | Same shape as input | Original | Adding group stats to each row |
groupby().filter() | Subset of original | Original | Removing entire groups |
Common Mistakes to Avoid
- Forgetting reset_index() and being surprised by MultiIndex — The MultiIndex output of groupby is useful in some contexts but confusing for most reporting. Get in the habit of chaining
.reset_index()unless you specifically need the MultiIndex - Passing a string instead of a list for multiple columns —
df.groupby('department', 'region')raises a TypeError. Always pass multiple columns as a list:df.groupby(['department', 'region']) - Using agg() with a dictionary instead of named aggregation — The old-style
agg({'salary': 'sum'})dictionary syntax produces columns named after the source column — not custom names. Use the named aggregation syntaxagg(total_salary=('salary', 'sum'))for cleaner, more readable output - Confusing groupby().agg() with groupby().transform() —
agg()collapses groups into one row per group.transform()returns a result with the same shape as the input — one value per original row. Useagg()for summary tables andtransform()for adding group statistics back to the original DataFrame - Applying count() to a column with NaN values —
count()excludes NaN values. If you want to count all rows in a group including NaN, usesize()instead:df.groupby(['dept', 'region']).size().reset_index(name='count')
Quick Reference Cheat Sheet
| Task | Code |
|---|---|
| Group by two columns | df.groupby(['col1', 'col2'])['val'].sum().reset_index() |
| Multiple aggregations | df.groupby(['col1', 'col2']).agg(total=('val','sum'), avg=('val','mean')).reset_index() |
| Sort results | .sort_values('col', ascending=False) |
| Count rows per group | df.groupby(['col1', 'col2']).size().reset_index(name='count') |
| Add group stat to rows | df['col'] = df.groupby(['col1', 'col2'])['val'].transform('mean') |
| Filter groups | df.groupby('col').filter(lambda x: x['val'].mean() > 75000) |
| Skip reset_index | df.groupby(['col1', 'col2'], as_index=False)['val'].sum() |
| Custom function | df.groupby(['col1']).agg(range=('val', lambda x: x.max()-x.min())) |
Grouping data by multiple columns in pandas is one of the most frequently used operations in data analysis — and once you master the patterns, it becomes one of the fastest ways to extract meaningful insights from raw data.
Here is the simplest summary:
- Pass a list of columns to
groupby()to group by multiple dimensions - Always chain
.reset_index()for clean, flat DataFrame output - Use named aggregation with
agg()for multiple metrics and readable column names - Use
transform()when you need group statistics added back to the original rows - Use
filter()to remove entire groups that do not meet a condition - Use
size()instead ofcount()when you need to count rows including NaN values
Start with simple two-column groupby and single aggregation. Then build up to multi-column groupby with named aggregation using agg(). Once those patterns feel natural, add transform() for the most powerful group-level analyses.
FAQs
How do I group by multiple columns in pandas?
Pass a list of column names to groupby(): df.groupby(['col1', 'col2'])['value'].sum().reset_index(). The list can contain any number of columns.
What is the difference between agg() and transform() in pandas groupby?
agg() collapses each group into a single summary row — producing a smaller DataFrame with one row per group. transform() returns a result with the same shape as the original DataFrame — adding group-level statistics back to every row. Use agg() for summary tables and transform() for adding group stats to individual rows.
Why does my groupby return a MultiIndex?
By default, grouped columns become the index levels of the result — creating a MultiIndex. Chain .reset_index() after the aggregation to convert back to a flat DataFrame with regular integer index: df.groupby(['col1', 'col2'])['val'].sum().reset_index().
How do I apply different aggregations to different columns in groupby?
Use named aggregation in agg(): df.groupby(['col1', 'col2']).agg(total=('val1', 'sum'), avg=('val2', 'mean')).reset_index(). Each tuple specifies the source column and aggregation function, and the keyword argument sets the output column name.
What is the difference between count() and size() in pandas groupby?
count() excludes NaN values — it counts non-null values in each column. size() counts all rows in each group regardless of NaN values. Use size() when you want a true row count: df.groupby(['col1', 'col2']).size().reset_index(name='count').
Can I use a custom function with groupby in pandas?
Yes. Pass a lambda function inside agg(): df.groupby(['col1']).agg(salary_range=('salary', lambda x: x.max() - x.min())).reset_index(). Custom functions receive each group as a Series and can return any scalar value.