How to Group Data by Multiple Columns in Pandas

How to Group Data by Multiple Columns in Pandas

Grouping data is one of the most fundamental operations in data analysis. Whether you need total sales by region and product, average salary by department and job level, or order counts by customer and month — grouping by multiple columns is how you answer those questions.

Pandas groupby() is the tool that makes this possible and once you understand how it works, it becomes one of the most reached-for functions in your entire data science toolkit.

In this guide, we will walk through everything you need to know about grouping data by multiple columns in pandas from basic syntax to advanced aggregation patterns with real examples.

Setting Up the Example Dataset

python

import pandas as pd
import numpy as np

data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve',
             'Frank', 'Grace', 'Henry', 'Iris', 'Jack'],
    'department': ['Engineering', 'Marketing', 'Engineering', 'HR',
                   'Sales', 'Marketing', 'Engineering', 'HR',
                   'Sales', 'Engineering'],
    'region': ['North', 'South', 'North', 'East',
               'West', 'South', 'North', 'East',
               'West', 'South'],
    'job_level': ['Senior', 'Junior', 'Mid', 'Senior',
                  'Junior', 'Mid', 'Senior', 'Junior',
                  'Mid', 'Senior'],
    'salary': [95000, 62000, 78000, 88000, 55000,
               71000, 92000, 58000, 67000, 89000],
    'bonus': [12000, 5000, 8000, 10000, 4000,
              7000, 11000, 5500, 6500, 9000],
    'years_exp': [8, 3, 5, 7, 2, 4, 9, 3, 5, 7]
}

df = pd.DataFrame(data)
print(df.head())

Output:

namedepartmentregionjob_levelsalarybonusyears_exp
AliceEngineeringNorthSenior95000120008
BobMarketingSouthJunior6200050003
CharlieEngineeringNorthMid7800080005
DianaHREastSenior88000100007
EveSalesWestJunior5500040002

Basic groupby() With a Single Column

Before grouping by multiple columns, understand the single-column version.

python

# Group by one column and sum salaries
result = df.groupby('department')['salary'].sum()
print(result)

Output:

department
Engineering    354000
HR             146000
Marketing      133000
Sales          122000
Name: salary, dtype: int64

This gives total salary per department. Now let us extend this to multiple columns.

groupby() With Multiple Columns — The Core Pattern

Pass a list of column names to groupby() to group by multiple columns simultaneously.

python

# Group by two columns
result = df.groupby(['department', 'region'])['salary'].sum()
print(result)

Output:

department   region
Engineering  North     265000
             South      89000
HR           East       146000
Marketing    South      133000
Sales        West       122000
Name: salary, dtype: int64

The result has a MultiIndex — a hierarchical index with both department and region as levels. Each unique combination of department + region becomes one group.

Resetting the Index — Flat DataFrame Output

The MultiIndex output can be difficult to work with. Use reset_index() to convert back to a flat DataFrame.

python

result = df.groupby(['department', 'region'])['salary'].sum().reset_index()
print(result)

Output:

departmentregionsalary
EngineeringNorth265000
EngineeringSouth89000
HREast146000
MarketingSouth133000
SalesWest122000

Now you have a clean, flat DataFrame where department and region are regular columns — easy to sort, filter, and export.

Best practice: Almost always chain .reset_index() after groupby() for cleaner output.

Grouping by Three or More Columns

The same pattern extends to any number of columns.

python

result = df.groupby(['department', 'region', 'job_level'])['salary'].mean().reset_index()
result['salary'] = result['salary'].round(0)
print(result)

Output:

departmentregionjob_levelsalary
EngineeringNorthMid78000
EngineeringNorthSenior93500
EngineeringSouthSenior89000
HREastJunior58000
HREastSenior88000
MarketingSouthJunior62000
MarketingSouthMid71000
SalesWestJunior55000
SalesWestMid67000

Three-level grouping gives you the average salary for every unique combination of department, region, and job level.

Common Aggregation Functions

sum() — Total Values

python

dept_region_salary = df.groupby(['department', 'region'])['salary'].sum().reset_index()
dept_region_salary.columns = ['department', 'region', 'total_salary']
print(dept_region_salary)

mean() — Average Values

python

avg_salary = df.groupby(['department', 'job_level'])['salary'].mean().reset_index()
avg_salary['salary'] = avg_salary['salary'].round(2)
print(avg_salary)

count() — Count Rows Per Group

python

headcount = df.groupby(['department', 'region'])['name'].count().reset_index()
headcount.columns = ['department', 'region', 'headcount']
print(headcount)

min() and max() — Range Analysis

python

salary_range = df.groupby(['department', 'job_level']).agg(
    min_salary=('salary', 'min'),
    max_salary=('salary', 'max')
).reset_index()
print(salary_range)

Aggregating Multiple Columns at Once

Pass a list of columns before calling the aggregation function to aggregate multiple columns simultaneously.

python

# Sum both salary and bonus grouped by department and region
result = df.groupby(['department', 'region'])[['salary', 'bonus']].sum().reset_index()
print(result)

Output:

departmentregionsalarybonus
EngineeringNorth26500031000
EngineeringSouth890009000
HREast14600015500
MarketingSouth13300012000
SalesWest12200010500

Named Aggregation With agg() — Most Flexible Pattern

The agg() method with named aggregation is the most powerful and readable way to apply multiple different aggregations to different columns in one step.

Basic Named Aggregation

python

result = df.groupby(['department', 'region']).agg(
    total_salary=('salary', 'sum'),
    avg_salary=('salary', 'mean'),
    max_salary=('salary', 'max'),
    total_bonus=('bonus', 'sum'),
    headcount=('name', 'count'),
    avg_experience=('years_exp', 'mean')
).reset_index()

result['avg_salary'] = result['avg_salary'].round(0)
result['avg_experience'] = result['avg_experience'].round(1)

print(result)

Output:

departmentregiontotal_salaryavg_salarymax_salarytotal_bonusheadcountavg_experience
EngineeringNorth26500088333950003100037.3
EngineeringSouth890008900089000900017.0
HREast14600073000880001550025.0
MarketingSouth13300066500710001200023.5
SalesWest12200061000670001050023.5

The syntax is:

python

new_column_name=('source_column', 'aggregation_function')

This is clean, readable, and produces exactly the column names you want without any renaming step.

Using Custom Functions in agg()

python

result = df.groupby(['department', 'job_level']).agg(
    avg_salary=('salary', 'mean'),
    salary_range=('salary', lambda x: x.max() - x.min()),
    headcount=('name', 'count'),
    total_comp=('salary', lambda x: x.sum() + df.loc[x.index, 'bonus'].sum())
).reset_index()

print(result)

Lambda functions inside agg() give you complete flexibility for custom calculations.

Sorting Results After Groupby

Chain sort_values() after reset_index() to sort the output.

python

# Top departments by total salary, then by region
result = (
    df.groupby(['department', 'region'])['salary']
    .sum()
    .reset_index()
    .sort_values('salary', ascending=False)
    .reset_index(drop=True)
)

print(result)

Output:

departmentregionsalary
EngineeringNorth265000
HREast146000
MarketingSouth133000
SalesWest122000
EngineeringSouth89000

Multi-Column Sort

python

result = (
    df.groupby(['department', 'job_level'])['salary']
    .mean()
    .reset_index()
    .sort_values(['department', 'salary'], ascending=[True, False])
)

print(result)

Filtering Groups With filter()

The filter() method keeps or removes entire groups based on a condition applied to the group.

python

# Keep only departments where average salary exceeds 75000
high_salary_depts = df.groupby('department').filter(
    lambda x: x['salary'].mean() > 75000
)

print(high_salary_depts[['name', 'department', 'salary']])

python

# Keep only department-region combinations with more than 1 employee
multi_employee_groups = df.groupby(['department', 'region']).filter(
    lambda x: len(x) > 1
)

print(multi_employee_groups[['name', 'department', 'region', 'salary']])

Adding Group Aggregations Back to the Original DataFrame

Use transform() to calculate group-level statistics and add them as new columns to the original DataFrame — keeping all rows intact.

python

# Add department average salary and department headcount to each row
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean').round(0)
df['dept_headcount'] = df.groupby('department')['name'].transform('count')

# Add multi-column group stats
df['dept_region_avg_salary'] = df.groupby(
    ['department', 'region']
)['salary'].transform('mean').round(0)

print(df[['name', 'department', 'region', 'salary',
          'dept_avg_salary', 'dept_headcount',
          'dept_region_avg_salary']])

Output:

namedepartmentregionsalarydept_avg_salarydept_headcountdept_region_avg_salary
AliceEngineeringNorth9500088500488333
BobMarketingSouth6200066500266500
CharlieEngineeringNorth7800088500488333
DianaHREast8800073000273000

transform() is extremely useful for calculating things like “salary relative to department average” or “percentage of department total.”

python

# Salary as percentage of department total
df['pct_of_dept_salary'] = (
    df['salary'] / df.groupby('department')['salary'].transform('sum') * 100
).round(1)

print(df[['name', 'department', 'salary', 'pct_of_dept_salary']])

Groupby With as_index=False — Skip reset_index()

Setting as_index=False prevents the group columns from becoming an index — giving you a flat DataFrame directly without needing reset_index().

python

# as_index=False is equivalent to chaining .reset_index()
result = df.groupby(
    ['department', 'region'],
    as_index=False
)['salary'].sum()

print(result)

Both approaches produce the same result. as_index=False is slightly more concise. reset_index() is more explicit and widely used.

Real-World Use Cases

HR Analytics — Department and Level Summary

python

hr_summary = df.groupby(['department', 'job_level']).agg(
    headcount=('name', 'count'),
    avg_salary=('salary', 'mean'),
    avg_bonus=('bonus', 'mean'),
    avg_experience=('years_exp', 'mean')
).reset_index()

hr_summary['avg_salary'] = hr_summary['avg_salary'].round(0)
hr_summary['avg_bonus'] = hr_summary['avg_bonus'].round(0)
hr_summary['avg_experience'] = hr_summary['avg_experience'].round(1)

print(hr_summary.sort_values(['department', 'avg_salary'], ascending=[True, False]))

Sales Analysis — Region and Category Performance

python

# Simulated sales data
sales_data = df.copy()
sales_data['revenue'] = np.random.randint(50000, 200000, len(df))
sales_data['units'] = np.random.randint(100, 500, len(df))

sales_summary = sales_data.groupby(['region', 'department']).agg(
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean'),
    total_units=('units', 'sum'),
    transaction_count=('revenue', 'count')
).reset_index()

sales_summary['revenue_per_unit'] = (
    sales_summary['total_revenue'] / sales_summary['total_units']
).round(2)

print(sales_summary.sort_values('total_revenue', ascending=False))

Compensation Benchmarking

python

compensation = df.groupby(['department', 'job_level']).agg(
    headcount=('salary', 'count'),
    min_salary=('salary', 'min'),
    p25_salary=('salary', lambda x: x.quantile(0.25)),
    median_salary=('salary', 'median'),
    p75_salary=('salary', lambda x: x.quantile(0.75)),
    max_salary=('salary', 'max')
).reset_index()

print(compensation)

Comparison Table: Groupby Output Options

MethodReturnsIndex TypeBest For
groupby().sum()Series or DataFrameMultiIndexQuick single-metric summaries
groupby().sum().reset_index()Flat DataFrameIntegerMost reporting use cases
groupby(as_index=False).sum()Flat DataFrameIntegerSame as above, slightly shorter
groupby().agg(name=(...))DataFrameMultiIndexMultiple metrics, clean names
groupby().agg().reset_index()Flat DataFrameIntegerProduction reporting
groupby().transform()Same shape as inputOriginalAdding group stats to each row
groupby().filter()Subset of originalOriginalRemoving entire groups

Common Mistakes to Avoid

  • Forgetting reset_index() and being surprised by MultiIndex — The MultiIndex output of groupby is useful in some contexts but confusing for most reporting. Get in the habit of chaining .reset_index() unless you specifically need the MultiIndex
  • Passing a string instead of a list for multiple columnsdf.groupby('department', 'region') raises a TypeError. Always pass multiple columns as a list: df.groupby(['department', 'region'])
  • Using agg() with a dictionary instead of named aggregation — The old-style agg({'salary': 'sum'}) dictionary syntax produces columns named after the source column — not custom names. Use the named aggregation syntax agg(total_salary=('salary', 'sum')) for cleaner, more readable output
  • Confusing groupby().agg() with groupby().transform()agg() collapses groups into one row per group. transform() returns a result with the same shape as the input — one value per original row. Use agg() for summary tables and transform() for adding group statistics back to the original DataFrame
  • Applying count() to a column with NaN valuescount() excludes NaN values. If you want to count all rows in a group including NaN, use size() instead: df.groupby(['dept', 'region']).size().reset_index(name='count')

Quick Reference Cheat Sheet

TaskCode
Group by two columnsdf.groupby(['col1', 'col2'])['val'].sum().reset_index()
Multiple aggregationsdf.groupby(['col1', 'col2']).agg(total=('val','sum'), avg=('val','mean')).reset_index()
Sort results.sort_values('col', ascending=False)
Count rows per groupdf.groupby(['col1', 'col2']).size().reset_index(name='count')
Add group stat to rowsdf['col'] = df.groupby(['col1', 'col2'])['val'].transform('mean')
Filter groupsdf.groupby('col').filter(lambda x: x['val'].mean() > 75000)
Skip reset_indexdf.groupby(['col1', 'col2'], as_index=False)['val'].sum()
Custom functiondf.groupby(['col1']).agg(range=('val', lambda x: x.max()-x.min()))

Grouping data by multiple columns in pandas is one of the most frequently used operations in data analysis — and once you master the patterns, it becomes one of the fastest ways to extract meaningful insights from raw data.

Here is the simplest summary:

  • Pass a list of columns to groupby() to group by multiple dimensions
  • Always chain .reset_index() for clean, flat DataFrame output
  • Use named aggregation with agg() for multiple metrics and readable column names
  • Use transform() when you need group statistics added back to the original rows
  • Use filter() to remove entire groups that do not meet a condition
  • Use size() instead of count() when you need to count rows including NaN values

Start with simple two-column groupby and single aggregation. Then build up to multi-column groupby with named aggregation using agg(). Once those patterns feel natural, add transform() for the most powerful group-level analyses.

FAQs

How do I group by multiple columns in pandas?

Pass a list of column names to groupby(): df.groupby(['col1', 'col2'])['value'].sum().reset_index(). The list can contain any number of columns.

What is the difference between agg() and transform() in pandas groupby?

agg() collapses each group into a single summary row — producing a smaller DataFrame with one row per group. transform() returns a result with the same shape as the original DataFrame — adding group-level statistics back to every row. Use agg() for summary tables and transform() for adding group stats to individual rows.

Why does my groupby return a MultiIndex?

By default, grouped columns become the index levels of the result — creating a MultiIndex. Chain .reset_index() after the aggregation to convert back to a flat DataFrame with regular integer index: df.groupby(['col1', 'col2'])['val'].sum().reset_index().

How do I apply different aggregations to different columns in groupby?

Use named aggregation in agg(): df.groupby(['col1', 'col2']).agg(total=('val1', 'sum'), avg=('val2', 'mean')).reset_index(). Each tuple specifies the source column and aggregation function, and the keyword argument sets the output column name.

What is the difference between count() and size() in pandas groupby?

count() excludes NaN values — it counts non-null values in each column. size() counts all rows in each group regardless of NaN values. Use size() when you want a true row count: df.groupby(['col1', 'col2']).size().reset_index(name='count').

Can I use a custom function with groupby in pandas?

Yes. Pass a lambda function inside agg(): df.groupby(['col1']).agg(salary_range=('salary', lambda x: x.max() - x.min())).reset_index(). Custom functions receive each group as a Series and can return any scalar value.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top