How to Group Data by Multiple Columns in Pandas

Q: How do I group by multiple columns in pandas?

Pass a list of column names to groupby(): df.groupby(['col1', 'col2'])['value'].sum().reset_index(). The list can contain any number of columns.

Q: What is the difference between agg() and transform() in pandas groupby?

agg() collapses each group into a single summary row — producing a smaller DataFrame with one row per group. transform() returns a result with the same shape as the original DataFrame — adding group-level statistics back to every row. Use agg() for summary tables and transform() for adding group stats to individual rows.

Q: How do I apply different aggregations to different columns in groupby?

Use named aggregation in agg(): df.groupby(['col1', 'col2']).agg(total=('val1', 'sum'), avg=('val2', 'mean')).reset_index(). Each tuple specifies the source column and aggregation function, and the keyword argument sets the output column name.

Q: What is the difference between count() and size() in pandas groupby?

count() excludes NaN values — it counts non-null values in each column. size() counts all rows in each group regardless of NaN values. Use size() when you want a true row count: df.groupby(['col1', 'col2']).size().reset_index(name='count').

Q: Can I use a custom function with groupby in pandas?

Yes. Pass a lambda function inside agg(): df.groupby(['col1']).agg(salary_range=('salary', lambda x: x.max() - x.min())).reset_index(). Custom functions receive each group as a Series and can return any scalar value.

Grouping data is one of the most fundamental operations in data analysis. Whether you need total sales by region and product, average salary by department and job level, or order counts by customer and month — grouping by multiple columns is how you answer those questions.

Pandas groupby() is the tool that makes this possible and once you understand how it works, it becomes one of the most reached-for functions in your entire data science toolkit.

In this guide, we will walk through everything you need to know about grouping data by multiple columns in pandas from basic syntax to advanced aggregation patterns with real examples.

Setting Up the Example Dataset

python

import pandas as pd
import numpy as np

data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve',
             'Frank', 'Grace', 'Henry', 'Iris', 'Jack'],
    'department': ['Engineering', 'Marketing', 'Engineering', 'HR',
                   'Sales', 'Marketing', 'Engineering', 'HR',
                   'Sales', 'Engineering'],
    'region': ['North', 'South', 'North', 'East',
               'West', 'South', 'North', 'East',
               'West', 'South'],
    'job_level': ['Senior', 'Junior', 'Mid', 'Senior',
                  'Junior', 'Mid', 'Senior', 'Junior',
                  'Mid', 'Senior'],
    'salary': [95000, 62000, 78000, 88000, 55000,
               71000, 92000, 58000, 67000, 89000],
    'bonus': [12000, 5000, 8000, 10000, 4000,
              7000, 11000, 5500, 6500, 9000],
    'years_exp': [8, 3, 5, 7, 2, 4, 9, 3, 5, 7]
}

df = pd.DataFrame(data)
print(df.head())

Output:

name	department	region	job_level	salary	bonus	years_exp
Alice	Engineering	North	Senior	95000	12000	8
Bob	Marketing	South	Junior	62000	5000	3
Charlie	Engineering	North	Mid	78000	8000	5
Diana	HR	East	Senior	88000	10000	7
Eve	Sales	West	Junior	55000	4000	2

Basic groupby() With a Single Column

Before grouping by multiple columns, understand the single-column version.

python

# Group by one column and sum salaries
result = df.groupby('department')['salary'].sum()
print(result)

Output:

department
Engineering    354000
HR             146000
Marketing      133000
Sales          122000
Name: salary, dtype: int64

This gives total salary per department. Now let us extend this to multiple columns.

groupby() With Multiple Columns — The Core Pattern

Pass a list of column names to groupby() to group by multiple columns simultaneously.

python

# Group by two columns
result = df.groupby(['department', 'region'])['salary'].sum()
print(result)

Output:

department   region
Engineering  North     265000
             South      89000
HR           East       146000
Marketing    South      133000
Sales        West       122000
Name: salary, dtype: int64

The result has a MultiIndex — a hierarchical index with both department and region as levels. Each unique combination of department + region becomes one group.

Resetting the Index — Flat DataFrame Output

The MultiIndex output can be difficult to work with. Use reset_index() to convert back to a flat DataFrame.

python

result = df.groupby(['department', 'region'])['salary'].sum().reset_index()
print(result)

Output:

department	region	salary
Engineering	North	265000
Engineering	South	89000
HR	East	146000
Marketing	South	133000
Sales	West	122000

Now you have a clean, flat DataFrame where department and region are regular columns — easy to sort, filter, and export.

Best practice: Almost always chain .reset_index() after groupby() for cleaner output.

Grouping by Three or More Columns

The same pattern extends to any number of columns.

python

result = df.groupby(['department', 'region', 'job_level'])['salary'].mean().reset_index()
result['salary'] = result['salary'].round(0)
print(result)

Output:

department	region	job_level	salary
Engineering	North	Mid	78000
Engineering	North	Senior	93500
Engineering	South	Senior	89000
HR	East	Junior	58000
HR	East	Senior	88000
Marketing	South	Junior	62000
Marketing	South	Mid	71000
Sales	West	Junior	55000
Sales	West	Mid	67000

Three-level grouping gives you the average salary for every unique combination of department, region, and job level.

Common Aggregation Functions

sum() — Total Values

python

dept_region_salary = df.groupby(['department', 'region'])['salary'].sum().reset_index()
dept_region_salary.columns = ['department', 'region', 'total_salary']
print(dept_region_salary)

mean() — Average Values

python

avg_salary = df.groupby(['department', 'job_level'])['salary'].mean().reset_index()
avg_salary['salary'] = avg_salary['salary'].round(2)
print(avg_salary)

count() — Count Rows Per Group

python

headcount = df.groupby(['department', 'region'])['name'].count().reset_index()
headcount.columns = ['department', 'region', 'headcount']
print(headcount)

min() and max() — Range Analysis

python

salary_range = df.groupby(['department', 'job_level']).agg(
    min_salary=('salary', 'min'),
    max_salary=('salary', 'max')
).reset_index()
print(salary_range)

Aggregating Multiple Columns at Once

Pass a list of columns before calling the aggregation function to aggregate multiple columns simultaneously.

python

# Sum both salary and bonus grouped by department and region
result = df.groupby(['department', 'region'])[['salary', 'bonus']].sum().reset_index()
print(result)

Output:

department	region	salary	bonus
Engineering	North	265000	31000
Engineering	South	89000	9000
HR	East	146000	15500
Marketing	South	133000	12000
Sales	West	122000	10500

Named Aggregation With agg() — Most Flexible Pattern

The agg() method with named aggregation is the most powerful and readable way to apply multiple different aggregations to different columns in one step.

Basic Named Aggregation

python

result = df.groupby(['department', 'region']).agg(
    total_salary=('salary', 'sum'),
    avg_salary=('salary', 'mean'),
    max_salary=('salary', 'max'),
    total_bonus=('bonus', 'sum'),
    headcount=('name', 'count'),
    avg_experience=('years_exp', 'mean')
).reset_index()

result['avg_salary'] = result['avg_salary'].round(0)
result['avg_experience'] = result['avg_experience'].round(1)

print(result)

Output:

department	region	total_salary	avg_salary	max_salary	total_bonus	headcount	avg_experience
Engineering	North	265000	88333	95000	31000	3	7.3
Engineering	South	89000	89000	89000	9000	1	7.0
HR	East	146000	73000	88000	15500	2	5.0
Marketing	South	133000	66500	71000	12000	2	3.5
Sales	West	122000	61000	67000	10500	2	3.5

The syntax is:

python

new_column_name=('source_column', 'aggregation_function')

This is clean, readable, and produces exactly the column names you want without any renaming step.

Using Custom Functions in agg()

python

result = df.groupby(['department', 'job_level']).agg(
    avg_salary=('salary', 'mean'),
    salary_range=('salary', lambda x: x.max() - x.min()),
    headcount=('name', 'count'),
    total_comp=('salary', lambda x: x.sum() + df.loc[x.index, 'bonus'].sum())
).reset_index()

print(result)

Lambda functions inside agg() give you complete flexibility for custom calculations.

Sorting Results After Groupby

Chain sort_values() after reset_index() to sort the output.

python

# Top departments by total salary, then by region
result = (
    df.groupby(['department', 'region'])['salary']
    .sum()
    .reset_index()
    .sort_values('salary', ascending=False)
    .reset_index(drop=True)
)

print(result)

Output:

department	region	salary
Engineering	North	265000
HR	East	146000
Marketing	South	133000
Sales	West	122000
Engineering	South	89000

Multi-Column Sort

python

result = (
    df.groupby(['department', 'job_level'])['salary']
    .mean()
    .reset_index()
    .sort_values(['department', 'salary'], ascending=[True, False])
)

print(result)

Filtering Groups With filter()

The filter() method keeps or removes entire groups based on a condition applied to the group.

python

# Keep only departments where average salary exceeds 75000
high_salary_depts = df.groupby('department').filter(
    lambda x: x['salary'].mean() > 75000
)

print(high_salary_depts[['name', 'department', 'salary']])

python

# Keep only department-region combinations with more than 1 employee
multi_employee_groups = df.groupby(['department', 'region']).filter(
    lambda x: len(x) > 1
)

print(multi_employee_groups[['name', 'department', 'region', 'salary']])

Adding Group Aggregations Back to the Original DataFrame

Use transform() to calculate group-level statistics and add them as new columns to the original DataFrame — keeping all rows intact.

python

# Add department average salary and department headcount to each row
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean').round(0)
df['dept_headcount'] = df.groupby('department')['name'].transform('count')

# Add multi-column group stats
df['dept_region_avg_salary'] = df.groupby(
    ['department', 'region']
)['salary'].transform('mean').round(0)

print(df[['name', 'department', 'region', 'salary',
          'dept_avg_salary', 'dept_headcount',
          'dept_region_avg_salary']])

Output:

name	department	region	salary	dept_avg_salary	dept_headcount	dept_region_avg_salary
Alice	Engineering	North	95000	88500	4	88333
Bob	Marketing	South	62000	66500	2	66500
Charlie	Engineering	North	78000	88500	4	88333
Diana	HR	East	88000	73000	2	73000

transform() is extremely useful for calculating things like “salary relative to department average” or “percentage of department total.”

python

# Salary as percentage of department total
df['pct_of_dept_salary'] = (
    df['salary'] / df.groupby('department')['salary'].transform('sum') * 100
).round(1)

print(df[['name', 'department', 'salary', 'pct_of_dept_salary']])

Groupby With as_index=False — Skip reset_index()

Setting as_index=False prevents the group columns from becoming an index — giving you a flat DataFrame directly without needing reset_index().

python

# as_index=False is equivalent to chaining .reset_index()
result = df.groupby(
    ['department', 'region'],
    as_index=False
)['salary'].sum()

print(result)

Both approaches produce the same result. as_index=False is slightly more concise. reset_index() is more explicit and widely used.

Real-World Use Cases

HR Analytics — Department and Level Summary

python

hr_summary = df.groupby(['department', 'job_level']).agg(
    headcount=('name', 'count'),
    avg_salary=('salary', 'mean'),
    avg_bonus=('bonus', 'mean'),
    avg_experience=('years_exp', 'mean')
).reset_index()

hr_summary['avg_salary'] = hr_summary['avg_salary'].round(0)
hr_summary['avg_bonus'] = hr_summary['avg_bonus'].round(0)
hr_summary['avg_experience'] = hr_summary['avg_experience'].round(1)

print(hr_summary.sort_values(['department', 'avg_salary'], ascending=[True, False]))

Sales Analysis — Region and Category Performance

python

# Simulated sales data
sales_data = df.copy()
sales_data['revenue'] = np.random.randint(50000, 200000, len(df))
sales_data['units'] = np.random.randint(100, 500, len(df))

sales_summary = sales_data.groupby(['region', 'department']).agg(
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean'),
    total_units=('units', 'sum'),
    transaction_count=('revenue', 'count')
).reset_index()

sales_summary['revenue_per_unit'] = (
    sales_summary['total_revenue'] / sales_summary['total_units']
).round(2)

print(sales_summary.sort_values('total_revenue', ascending=False))

Compensation Benchmarking

python

compensation = df.groupby(['department', 'job_level']).agg(
    headcount=('salary', 'count'),
    min_salary=('salary', 'min'),
    p25_salary=('salary', lambda x: x.quantile(0.25)),
    median_salary=('salary', 'median'),
    p75_salary=('salary', lambda x: x.quantile(0.75)),
    max_salary=('salary', 'max')
).reset_index()

print(compensation)

Comparison Table: Groupby Output Options

Method	Returns	Index Type	Best For
`groupby().sum()`	Series or DataFrame	MultiIndex	Quick single-metric summaries
`groupby().sum().reset_index()`	Flat DataFrame	Integer	Most reporting use cases
`groupby(as_index=False).sum()`	Flat DataFrame	Integer	Same as above, slightly shorter
`groupby().agg(name=(...))`	DataFrame	MultiIndex	Multiple metrics, clean names
`groupby().agg().reset_index()`	Flat DataFrame	Integer	Production reporting
`groupby().transform()`	Same shape as input	Original	Adding group stats to each row
`groupby().filter()`	Subset of original	Original	Removing entire groups

Common Mistakes to Avoid

Forgetting reset_index() and being surprised by MultiIndex — The MultiIndex output of groupby is useful in some contexts but confusing for most reporting. Get in the habit of chaining .reset_index() unless you specifically need the MultiIndex
Passing a string instead of a list for multiple columns — df.groupby('department', 'region') raises a TypeError. Always pass multiple columns as a list: df.groupby(['department', 'region'])
Using agg() with a dictionary instead of named aggregation — The old-style agg({'salary': 'sum'}) dictionary syntax produces columns named after the source column — not custom names. Use the named aggregation syntax agg(total_salary=('salary', 'sum')) for cleaner, more readable output
Confusing groupby().agg() with groupby().transform() — agg() collapses groups into one row per group. transform() returns a result with the same shape as the input — one value per original row. Use agg() for summary tables and transform() for adding group statistics back to the original DataFrame
Applying count() to a column with NaN values — count() excludes NaN values. If you want to count all rows in a group including NaN, use size() instead: df.groupby(['dept', 'region']).size().reset_index(name='count')

Quick Reference Cheat Sheet

Task	Code
Group by two columns	`df.groupby(['col1', 'col2'])['val'].sum().reset_index()`
Multiple aggregations	`df.groupby(['col1', 'col2']).agg(total=('val','sum'), avg=('val','mean')).reset_index()`
Sort results	`.sort_values('col', ascending=False)`
Count rows per group	`df.groupby(['col1', 'col2']).size().reset_index(name='count')`
Add group stat to rows	`df['col'] = df.groupby(['col1', 'col2'])['val'].transform('mean')`
Filter groups	`df.groupby('col').filter(lambda x: x['val'].mean() > 75000)`
Skip reset_index	`df.groupby(['col1', 'col2'], as_index=False)['val'].sum()`
Custom function	`df.groupby(['col1']).agg(range=('val', lambda x: x.max()-x.min()))`

Grouping data by multiple columns in pandas is one of the most frequently used operations in data analysis — and once you master the patterns, it becomes one of the fastest ways to extract meaningful insights from raw data.

Here is the simplest summary:

Pass a list of columns to groupby() to group by multiple dimensions
Always chain .reset_index() for clean, flat DataFrame output
Use named aggregation with agg() for multiple metrics and readable column names
Use transform() when you need group statistics added back to the original rows
Use filter() to remove entire groups that do not meet a condition
Use size() instead of count() when you need to count rows including NaN values

Start with simple two-column groupby and single aggregation. Then build up to multi-column groupby with named aggregation using agg(). Once those patterns feel natural, add transform() for the most powerful group-level analyses.

FAQs

How do I group by multiple columns in pandas?

Pass a list of column names to groupby(): df.groupby(['col1', 'col2'])['value'].sum().reset_index(). The list can contain any number of columns.

What is the difference between agg() and transform() in pandas groupby?

agg() collapses each group into a single summary row — producing a smaller DataFrame with one row per group. transform() returns a result with the same shape as the original DataFrame — adding group-level statistics back to every row. Use agg() for summary tables and transform() for adding group stats to individual rows.

Why does my groupby return a MultiIndex?

By default, grouped columns become the index levels of the result — creating a MultiIndex. Chain .reset_index() after the aggregation to convert back to a flat DataFrame with regular integer index: df.groupby(['col1', 'col2'])['val'].sum().reset_index().

How do I apply different aggregations to different columns in groupby?

Use named aggregation in agg(): df.groupby(['col1', 'col2']).agg(total=('val1', 'sum'), avg=('val2', 'mean')).reset_index(). Each tuple specifies the source column and aggregation function, and the keyword argument sets the output column name.

What is the difference between count() and size() in pandas groupby?

count() excludes NaN values — it counts non-null values in each column. size() counts all rows in each group regardless of NaN values. Use size() when you want a true row count: df.groupby(['col1', 'col2']).size().reset_index(name='count').

Can I use a custom function with groupby in pandas?

Yes. Pass a lambda function inside agg(): df.groupby(['col1']).agg(salary_range=('salary', lambda x: x.max() - x.min())).reset_index(). Custom functions receive each group as a Series and can return any scalar value.

How to Group Data by Multiple Columns in Pandas

Setting Up the Example Dataset

Basic groupby() With a Single Column

groupby() With Multiple Columns — The Core Pattern

Resetting the Index — Flat DataFrame Output

Grouping by Three or More Columns

Common Aggregation Functions

sum() — Total Values

mean() — Average Values

count() — Count Rows Per Group

min() and max() — Range Analysis

Aggregating Multiple Columns at Once

Named Aggregation With agg() — Most Flexible Pattern

Basic Named Aggregation

Using Custom Functions in agg()

Sorting Results After Groupby

Multi-Column Sort

Filtering Groups With filter()

Adding Group Aggregations Back to the Original DataFrame

Groupby With as_index=False — Skip reset_index()

Real-World Use Cases

HR Analytics — Department and Level Summary

Sales Analysis — Region and Category Performance

Compensation Benchmarking

Comparison Table: Groupby Output Options

Common Mistakes to Avoid

Quick Reference Cheat Sheet

FAQs

How do I group by multiple columns in pandas?

What is the difference between agg() and transform() in pandas groupby?

Why does my groupby return a MultiIndex?

How do I apply different aggregations to different columns in groupby?

What is the difference between count() and size() in pandas groupby?

Can I use a custom function with groupby in pandas?

Leave a Comment Cancel Reply

Copyright © 2026 codewithfimi.com - All Rights Reserved