How to Concatenate DataFrames in Pandas

One of the most common tasks in any data analysis or data engineering project is combining data from multiple sources into a single, unified dataset.

Maybe you have monthly sales files that need to be stacked into one annual dataset. Maybe you have survey results split across multiple tables that need to be combined. Maybe you are pulling data from different API calls and need to bring them all together.

In pandas, the primary tool for this job is pd.concat() and understanding how to use it correctly will save you a significant amount of time and frustration.

In this guide, we will walk through everything you need to know about concatenating DataFrames in pandas — what it means, how pd.concat works, all the key parameters, common patterns, and how it compares to merge and join.

What Is Concatenation in Pandas?

Concatenation means combining two or more DataFrames together by stacking them either vertically (adding more rows) or horizontally (adding more columns).

Simple Analogy

Think of DataFrames like spreadsheets.

Vertical concatenation is like taking two spreadsheets with the same columns and stacking one on top of the other to create one longer spreadsheet.

Horizontal concatenation is like placing two spreadsheets side by side to create one wider spreadsheet.

Both operations are done with pd.concat() in pandas.

Setting Up the Example DataFrames

We will use consistent example DataFrames throughout this guide.

python

import pandas as pd
import numpy as np

# Q1 Sales Data
df1 = pd.DataFrame({
    'employee_id': [101, 102, 103],
    'name': ['Alice', 'Bob', 'Charlie'],
    'department': ['Engineering', 'Marketing', 'Sales'],
    'q1_sales': [45000, 32000, 51000]
})

# Q2 Sales Data
df2 = pd.DataFrame({
    'employee_id': [101, 102, 104],
    'name': ['Alice', 'Bob', 'Diana'],
    'department': ['Engineering', 'Marketing', 'HR'],
    'q1_sales': [48000, 35000, 29000]
})

# Q3 Sales Data — different column name
df3 = pd.DataFrame({
    'employee_id': [103, 104, 105],
    'name': ['Charlie', 'Diana', 'Eve'],
    'department': ['Sales', 'HR', 'Engineering'],
    'q3_sales': [55000, 31000, 42000]
})

print("df1:")
print(df1)
print("\ndf2:")
print(df2)

df1:

employee_id	name	department	q1_sales
101	Alice	Engineering	45000
102	Bob	Marketing	32000
103	Charlie	Sales	51000

df2:

employee_id	name	department	q1_sales
101	Alice	Engineering	48000
102	Bob	Marketing	35000
104	Diana	HR	29000

The pd.concat() Function

Basic Syntax

python

pd.concat(
    objs,           # List of DataFrames to concatenate
    axis=0,         # 0 = vertical (rows), 1 = horizontal (columns)
    join='outer',   # 'outer' = keep all columns, 'inner' = keep common columns
    ignore_index=False,  # Reset index or keep original
    keys=None,      # Add hierarchical index labels
    sort=False      # Sort columns alphabetically
)

Let us go through every parameter with practical examples.

Method 1: Basic Vertical Concatenation (Stacking Rows)

The most common use case is stacking two DataFrames with the same columns on top of each other.

python

result = pd.concat([df1, df2])
print(result)

Output:

	employee_id	name	department	q1_sales
0	101	Alice	Engineering	45000
1	102	Bob	Marketing	32000
2	103	Charlie	Sales	51000
0	101	Alice	Engineering	48000
1	102	Bob	Marketing	35000
2	104	Diana	HR	29000

Notice the index: it shows 0, 1, 2, 0, 1, 2. The original indexes from both DataFrames are preserved. This can cause problems if you try to access rows by index later. The fix is ignore_index.

Method 2: Resetting the Index with ignore_index

python

result = pd.concat([df1, df2], ignore_index=True)
print(result)

Output:

	employee_id	name	department	q1_sales
0	101	Alice	Engineering	45000
1	102	Bob	Marketing	32000
2	103	Charlie	Sales	51000
3	101	Alice	Engineering	48000
4	102	Bob	Marketing	35000
5	104	Diana	HR	29000

Now the index runs continuously from 0 to 5. This is almost always what you want when stacking DataFrames vertically. Always use ignore_index=True unless you have a specific reason to preserve the original indexes.

Method 3: Concatenating More Than Two DataFrames

pd.concat() accepts a list of any number of DataFrames.

python

# Monthly sales data
jan = pd.DataFrame({'month': ['Jan'], 'sales': [120000]})
feb = pd.DataFrame({'month': ['Feb'], 'sales': [135000]})
mar = pd.DataFrame({'month': ['Mar'], 'sales': [148000]})
apr = pd.DataFrame({'month': ['Apr'], 'sales': [112000]})

# Concatenate all four at once
annual = pd.concat([jan, feb, mar, apr], ignore_index=True)
print(annual)

Output:

	month	sales
0	Jan	120000
1	Feb	135000
2	Mar	148000
3	Apr	112000

This is extremely useful in real-world workflows where you load multiple files in a loop and concatenate them all at once.

Method 4: Concatenating Multiple Files in a Loop

One of the most common real-world patterns — reading multiple CSV files and combining them into a single DataFrame.

python

import os

# Simulating multiple monthly files
dataframes = []

for month in ['january', 'february', 'march']:
    # In practice: df = pd.read_csv(f'{month}_sales.csv')
    df = pd.DataFrame({
        'month': [month],
        'revenue': [np.random.randint(100000, 200000)]
    })
    dataframes.append(df)

combined = pd.concat(dataframes, ignore_index=True)
print(combined)

The pattern:

Create an empty list
Loop through files or data sources
Append each DataFrame to the list
Call pd.concat() on the list once at the end

This is far more efficient than concatenating inside the loop. Concatenating one by one creates a new DataFrame on every iteration, which is slow on large datasets.

Method 5: Handling Different Columns — join Parameter

When the DataFrames you are concatenating have different columns, the join parameter controls what happens.

join=’outer’ (Default) — Keep All Columns

python

result_outer = pd.concat([df1, df3], join='outer', ignore_index=True)
print(result_outer)

Output:

	employee_id	name	department	q1_sales	q3_sales
0	101	Alice	Engineering	45000.0	NaN
1	102	Bob	Marketing	32000.0	NaN
2	103	Charlie	Sales	51000.0	NaN
3	103	Charlie	Sales	NaN	55000.0
4	104	Diana	HR	NaN	31000.0
5	105	Eve	Engineering	NaN	42000.0

With join='outer', all columns from both DataFrames are kept. Rows from df1 get NaN in the q3_sales column (which only exists in df3), and rows from df3 get NaN in the q1_sales column.

join=’inner’ — Keep Only Common Columns

python

result_inner = pd.concat([df1, df3], join='inner', ignore_index=True)
print(result_inner)

Output:

	employee_id	name	department
0	101	Alice	Engineering
1	102	Bob	Marketing
2	103	Charlie	Sales
3	103	Charlie	Sales
4	104	Diana	HR
5	105	Eve	Engineering

With join='inner', only the columns that exist in both DataFrames are kept — employee_id, name, and department. The q1_sales and q3_sales columns are dropped because neither exists in both DataFrames.

Method 6: Horizontal Concatenation (Adding Columns) — axis=1

Setting axis=1 stacks DataFrames side by side i.e. adding more columns rather than more rows.

python

# Employee basic info
df_info = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'department': ['Engineering', 'Marketing', 'Sales']
})

# Employee performance scores
df_scores = pd.DataFrame({
    'q1_score': [92, 78, 85],
    'q2_score': [95, 82, 88]
})

# Combine side by side
result = pd.concat([df_info, df_scores], axis=1)
print(result)

Output:

	name	department	q1_score	q2_score
0	Alice	Engineering	92	95
1	Bob	Marketing	78	82
2	Charlie	Sales	85	88

The two DataFrames are joined by index position i.e. row 0 of df_info is placed next to row 0 of df_scores. This works correctly here because both DataFrames have the same index.

Warning: Index Alignment with axis=1

When using axis=1, pandas aligns rows by their index values and not by position. If your DataFrames have different indexes, you can get unexpected NaN values.

python

df_a = pd.DataFrame({'A': [1, 2, 3]}, index=[0, 1, 2])
df_b = pd.DataFrame({'B': [4, 5, 6]}, index=[1, 2, 3])

result = pd.concat([df_a, df_b], axis=1)
print(result)

Output:

	A	B
0	1.0	NaN
1	2.0	4.0
2	3.0	5.0
3	NaN	6.0

Because index 0 only exists in df_a and index 3 only exists in df_b, those rows get NaN for the missing values. Always align your indexes before using axis=1 concatenation.

Method 7: Adding Hierarchical Keys with the keys Parameter

The keys parameter adds a hierarchical index (MultiIndex) to your concatenated DataFrame making it easy to identify which original DataFrame each row came from.

python

result = pd.concat([df1, df2], keys=['Q1', 'Q2'])
print(result)

Output:

		employee_id	name	department	q1_sales
Q1	0	101	Alice	Engineering	45000
	1	102	Bob	Marketing	32000
	2	103	Charlie	Sales	51000
Q2	0	101	Alice	Engineering	48000
	1	102	Bob	Marketing	35000
	2	104	Diana	HR	29000

Now you can access data from each original DataFrame by its key:

python

# Access only Q1 data
print(result.loc['Q1'])

# Access only Q2 data
print(result.loc['Q2'])

This is very useful when you need to track the source of each row after combining multiple DataFrames.

Method 8: Sorting Columns with sort Parameter

When concatenating DataFrames with different columns, the resulting column order can be unpredictable. Use sort=True to sort columns alphabetically.

python

result = pd.concat([df1, df3], join='outer', ignore_index=True, sort=True)
print(result)

Output:

Columns are now in alphabetical order: department, employee_id, name, q1_sales, q3_sales.

Method 9: Concatenating Series

pd.concat() works on Series as well as DataFrames.

python

s1 = pd.Series([1, 2, 3], name='A')
s2 = pd.Series([4, 5, 6], name='B')

# Stack Series vertically
vertical = pd.concat([s1, s2], ignore_index=True)
print(vertical)

# Place Series side by side as columns
horizontal = pd.concat([s1, s2], axis=1)
print(horizontal)

Vertical Output:

Horizontal Output:

	A	B
0	1	4
1	2	5
2	3	6

Concatenating Series with axis=1 creates a DataFrame where each Series becomes a column. This a very clean way to build DataFrames from multiple Series.

Method 10: Verifying the Concatenated Result

After concatenating, always verify your result to catch problems early.

python

result = pd.concat([df1, df2], ignore_index=True)

# Check shape — did you get the expected number of rows and columns?
print(f"Shape: {result.shape}")

# Check for unexpected NaN values
print(f"\nMissing values:\n{result.isnull().sum()}")

# Check for duplicate rows
print(f"\nDuplicate rows: {result.duplicated().sum()}")

# Preview the result
print(f"\nFirst few rows:\n{result.head()}")
print(f"\nLast few rows:\n{result.tail()}")

These checks take seconds and can save you from building analysis on top of a broken concatenation.

pd.concat() vs merge() vs join() — What Is the Difference?

A common question beginners ask is when to use concat vs merge vs join. They all combine DataFrames, but they serve very different purposes.

pd.concat()

Combines DataFrames by stacking them — either adding more rows (axis=0) or more columns (axis=1). It does not match rows based on key values, it simply places DataFrames next to or on top of each other.

python

# Stack rows from two DataFrames
pd.concat([df1, df2], ignore_index=True)

pd.merge()

Combines DataFrames by matching rows based on one or more key columns which is similar to SQL JOIN operations. Use merge when you need to bring related data together based on a shared identifier.

python

# Combine based on matching employee_id
pd.merge(df_employees, df_salaries, on='employee_id', how='left')

DataFrame.join()

Similar to merge but joins on the index by default. Most useful when the DataFrames share a meaningful index.

python

# Join on index
df_employees.join(df_scores)

Feature	pd.concat()	pd.merge()	df.join()
Primary Use	Stack DataFrames	Match rows on key columns	Join on index
Matches on Keys	No	Yes	Yes (index)
SQL Equivalent	UNION ALL	JOIN	JOIN on index
Handles Different Columns	Yes (outer/inner)	Yes	Yes
Direction	Rows or Columns	Columns only	Columns only
Best For	Stacking similar data	Combining related tables	Index-based combination

Simple rule:

Same structure, need to stack → concat
Different tables, need to match on a key → merge
Joining on index → join

Real-World Use Cases

ETL Pipelines — Combining Monthly Files

The most common real-world use of pd.concat() — loading multiple files and combining them into one dataset for analysis or loading into a data warehouse.

python

import glob

# Find all CSV files in a folder
files = glob.glob('data/sales_*.csv')

# Load and concatenate all files
dataframes = [pd.read_csv(f) for f in files]
combined_sales = pd.concat(dataframes, ignore_index=True)

print(f"Total records loaded: {len(combined_sales)}")

Survey Data — Combining Response Batches

Survey data often comes in multiple batches like early responses, late responses, different distribution channels. Concat stacks them all into one analysis-ready dataset.

python

batch1 = pd.read_csv('survey_batch1.csv')
batch2 = pd.read_csv('survey_batch2.csv')
batch3 = pd.read_csv('survey_batch3.csv')

all_responses = pd.concat([batch1, batch2, batch3], ignore_index=True)
print(f"Total responses: {len(all_responses)}")

Machine Learning — Combining Train and Test Sets

After building separate train and test DataFrames, sometimes you need to concatenate them for preprocessing steps like fitting a scaler or encoder on the full dataset.

python

# Add a source label before concatenating
train_df['split'] = 'train'
test_df['split'] = 'test'

full_df = pd.concat([train_df, test_df], ignore_index=True)

# Fit preprocessor on full data
# Then split back using the 'split' column
train_processed = full_df[full_df['split'] == 'train']
test_processed = full_df[full_df['split'] == 'test']

API Data Collection — Paginated Results

When pulling data from an API with pagination, each page comes as a separate response. Concat combines all pages into one DataFrame.

python

all_pages = []

for page in range(1, 11):
    # response = requests.get(f'https://api.example.com/data?page={page}')
    # df_page = pd.DataFrame(response.json())
    df_page = pd.DataFrame({'id': range(page*10, page*10+10)})
    all_pages.append(df_page)

full_data = pd.concat(all_pages, ignore_index=True)
print(f"Total records: {len(full_data)}")

Advantages and Disadvantages of pd.concat()

Advantages

Simple and flexible — handles both vertical and horizontal concatenation
Accepts any number of DataFrames in a single call
Handles mismatched columns gracefully with outer and inner join options
Works on both DataFrames and Series
Efficient when used correctly — pass a list rather than concatenating one by one
Keys parameter provides built-in source tracking

Disadvantages

Does not match on key columns — not suitable for relational data combination
Mismatched indexes with axis=1 can produce unexpected NaN values
Duplicate rows are not automatically detected or removed
Performance degrades significantly if concatenating inside a loop instead of collecting in a list first

Common Mistakes to Avoid

Concatenating inside a loop instead of collecting in a list — Every concat call inside a loop creates a new DataFrame object, which is extremely slow. Always collect DataFrames in a list and call pd.concat() once at the end

python

# Wrong — slow
result = pd.DataFrame()
for df in dataframes:
    result = pd.concat([result, df])

# Correct — fast
all_dfs = []
for df in dataframes:
    all_dfs.append(df)
result = pd.concat(all_dfs, ignore_index=True)

Forgetting ignore_index=True — Duplicate index values after concatenation can cause confusing behavior when filtering or accessing rows by index
Not checking for NaN values after concatenation — Mismatched columns silently create NaN values. Always run isnull().sum() after a concat to verify
Using concat when merge is needed — If you need to combine DataFrames based on matching values in a key column, use merge, not concat
Not verifying the shape after concatenation — Always check that the resulting DataFrame has the expected number of rows and columns
Concatenating DataFrames with incompatible data types — If the same column has different data types across DataFrames (e.g., integer in one, string in another), pandas will upcast to object, which can break downstream analysis

Quick Reference Cheat Sheet

Task	Code
Stack two DataFrames vertically	`pd.concat([df1, df2], ignore_index=True)`
Stack multiple DataFrames	`pd.concat([df1, df2, df3], ignore_index=True)`
Keep only common columns	`pd.concat([df1, df2], join='inner', ignore_index=True)`
Keep all columns (fill NaN)	`pd.concat([df1, df2], join='outer', ignore_index=True)`
Add columns side by side	`pd.concat([df1, df2], axis=1)`
Track source with keys	`pd.concat([df1, df2], keys=['Q1', 'Q2'])`
Combine files in a loop	`pd.concat([pd.read_csv(f) for f in files], ignore_index=True)`
Sort columns alphabetically	`pd.concat([df1, df2], sort=True, ignore_index=True)`

Concatenating DataFrames is one of the most fundamental operations in pandas and pd.concat() is the tool that makes it fast, flexible, and reliable.

Here is a quick recap of everything we covered:

pd.concat() combines DataFrames by stacking rows (axis=0) or columns (axis=1)
Always use ignore_index=True when stacking rows unless you need the original indexes
Use join=’outer’ to keep all columns and join=’inner’ to keep only common columns
Use keys to track which original DataFrame each row came from
Collect DataFrames in a list and call pd.concat() once — never concatenate inside a loop
Always verify your result with shape, isnull().sum(), and a quick head/tail check
Use merge when you need to combine based on key column matching, not concat

Once you are comfortable with pd.concat(), combining data from multiple sources becomes one of the fastest and most routine steps in your data analysis workflow.

FAQs

What is the difference between pd.concat and pd.merge in pandas?

pd.concat stacks DataFrames by adding rows or columns without matching on key values. pd.merge combines DataFrames by matching rows based on shared key columns which is similar to SQL JOIN.

How do I concatenate DataFrames with different columns?

Use pd.concat() with join=’outer’ to keep all columns (filling missing values with NaN) or join=’inner’ to keep only the columns that exist in all DataFrames.

How do I reset the index after concatenating DataFrames?

Use ignore_index=True in pd.concat() to reset the index to a clean sequential range starting from 0.

Can I concatenate more than two DataFrames at once?

Yes. Pass a list of any number of DataFrames to pd.concat(), for example pd.concat([df1, df2, df3, df4], ignore_index=True).

Why do I get NaN values after concatenating DataFrames?

NaN values appear when the DataFrames have different columns (with join=’outer’) or when using axis=1 with DataFrames that have different index values. Check your column names and index alignment.

Is pd.concat() slow?

pd.concat() is fast when called once on a list of DataFrames. It becomes slow when called repeatedly inside a loop. Always collect DataFrames in a list first and call pd.concat() once at the end.

How to Concatenate DataFrames in Pandas

What Is Concatenation in Pandas?

Simple Analogy

Setting Up the Example DataFrames

The pd.concat() Function

Basic Syntax

Method 1: Basic Vertical Concatenation (Stacking Rows)

Method 2: Resetting the Index with ignore_index

Method 3: Concatenating More Than Two DataFrames

Method 4: Concatenating Multiple Files in a Loop

Method 5: Handling Different Columns — join Parameter

join=’outer’ (Default) — Keep All Columns

join=’inner’ — Keep Only Common Columns

Method 6: Horizontal Concatenation (Adding Columns) — axis=1

Warning: Index Alignment with axis=1

Method 7: Adding Hierarchical Keys with the keys Parameter

Method 8: Sorting Columns with sort Parameter

Method 9: Concatenating Series

Method 10: Verifying the Concatenated Result

pd.concat() vs merge() vs join() — What Is the Difference?

pd.concat()

pd.merge()

DataFrame.join()

Real-World Use Cases

ETL Pipelines — Combining Monthly Files

Survey Data — Combining Response Batches

Machine Learning — Combining Train and Test Sets

API Data Collection — Paginated Results

Advantages and Disadvantages of pd.concat()

Advantages

Disadvantages

Common Mistakes to Avoid

Quick Reference Cheat Sheet

FAQs

What is the difference between pd.concat and pd.merge in pandas?

How do I concatenate DataFrames with different columns?

How do I reset the index after concatenating DataFrames?

Can I concatenate more than two DataFrames at once?

Why do I get NaN values after concatenating DataFrames?

Is pd.concat() slow?

Leave a Comment Cancel Reply

Copyright © 2025 codewithfimi.com - All Rights Reserved