How to Concatenate DataFrames in Pandas

How to Concatenate DataFrames in Pandas

One of the most common tasks in any data analysis or data engineering project is combining data from multiple sources into a single, unified dataset.

Maybe you have monthly sales files that need to be stacked into one annual dataset. Maybe you have survey results split across multiple tables that need to be combined. Maybe you are pulling data from different API calls and need to bring them all together.

In pandas, the primary tool for this job is pd.concat() and understanding how to use it correctly will save you a significant amount of time and frustration.

In this guide, we will walk through everything you need to know about concatenating DataFrames in pandas — what it means, how pd.concat works, all the key parameters, common patterns, and how it compares to merge and join.

What Is Concatenation in Pandas?

Concatenation means combining two or more DataFrames together by stacking them either vertically (adding more rows) or horizontally (adding more columns).

Simple Analogy

Think of DataFrames like spreadsheets.

Vertical concatenation is like taking two spreadsheets with the same columns and stacking one on top of the other to create one longer spreadsheet.

Horizontal concatenation is like placing two spreadsheets side by side to create one wider spreadsheet.

Both operations are done with pd.concat() in pandas.

Setting Up the Example DataFrames

We will use consistent example DataFrames throughout this guide.

python

import pandas as pd
import numpy as np

# Q1 Sales Data
df1 = pd.DataFrame({
    'employee_id': [101, 102, 103],
    'name': ['Alice', 'Bob', 'Charlie'],
    'department': ['Engineering', 'Marketing', 'Sales'],
    'q1_sales': [45000, 32000, 51000]
})

# Q2 Sales Data
df2 = pd.DataFrame({
    'employee_id': [101, 102, 104],
    'name': ['Alice', 'Bob', 'Diana'],
    'department': ['Engineering', 'Marketing', 'HR'],
    'q1_sales': [48000, 35000, 29000]
})

# Q3 Sales Data — different column name
df3 = pd.DataFrame({
    'employee_id': [103, 104, 105],
    'name': ['Charlie', 'Diana', 'Eve'],
    'department': ['Sales', 'HR', 'Engineering'],
    'q3_sales': [55000, 31000, 42000]
})

print("df1:")
print(df1)
print("\ndf2:")
print(df2)

df1:

employee_idnamedepartmentq1_sales
101AliceEngineering45000
102BobMarketing32000
103CharlieSales51000

df2:

employee_idnamedepartmentq1_sales
101AliceEngineering48000
102BobMarketing35000
104DianaHR29000

The pd.concat() Function

Basic Syntax

python

pd.concat(
    objs,           # List of DataFrames to concatenate
    axis=0,         # 0 = vertical (rows), 1 = horizontal (columns)
    join='outer',   # 'outer' = keep all columns, 'inner' = keep common columns
    ignore_index=False,  # Reset index or keep original
    keys=None,      # Add hierarchical index labels
    sort=False      # Sort columns alphabetically
)

Let us go through every parameter with practical examples.

Method 1: Basic Vertical Concatenation (Stacking Rows)

The most common use case is stacking two DataFrames with the same columns on top of each other.

python

result = pd.concat([df1, df2])
print(result)

Output:

employee_idnamedepartmentq1_sales
0101AliceEngineering45000
1102BobMarketing32000
2103CharlieSales51000
0101AliceEngineering48000
1102BobMarketing35000
2104DianaHR29000

Notice the index: it shows 0, 1, 2, 0, 1, 2. The original indexes from both DataFrames are preserved. This can cause problems if you try to access rows by index later. The fix is ignore_index.

Method 2: Resetting the Index with ignore_index

python

result = pd.concat([df1, df2], ignore_index=True)
print(result)

Output:

employee_idnamedepartmentq1_sales
0101AliceEngineering45000
1102BobMarketing32000
2103CharlieSales51000
3101AliceEngineering48000
4102BobMarketing35000
5104DianaHR29000

Now the index runs continuously from 0 to 5. This is almost always what you want when stacking DataFrames vertically. Always use ignore_index=True unless you have a specific reason to preserve the original indexes.

Method 3: Concatenating More Than Two DataFrames

pd.concat() accepts a list of any number of DataFrames.

python

# Monthly sales data
jan = pd.DataFrame({'month': ['Jan'], 'sales': [120000]})
feb = pd.DataFrame({'month': ['Feb'], 'sales': [135000]})
mar = pd.DataFrame({'month': ['Mar'], 'sales': [148000]})
apr = pd.DataFrame({'month': ['Apr'], 'sales': [112000]})

# Concatenate all four at once
annual = pd.concat([jan, feb, mar, apr], ignore_index=True)
print(annual)

Output:

monthsales
0Jan120000
1Feb135000
2Mar148000
3Apr112000

This is extremely useful in real-world workflows where you load multiple files in a loop and concatenate them all at once.

Method 4: Concatenating Multiple Files in a Loop

One of the most common real-world patterns — reading multiple CSV files and combining them into a single DataFrame.

python

import os

# Simulating multiple monthly files
dataframes = []

for month in ['january', 'february', 'march']:
    # In practice: df = pd.read_csv(f'{month}_sales.csv')
    df = pd.DataFrame({
        'month': [month],
        'revenue': [np.random.randint(100000, 200000)]
    })
    dataframes.append(df)

combined = pd.concat(dataframes, ignore_index=True)
print(combined)

The pattern:

  1. Create an empty list
  2. Loop through files or data sources
  3. Append each DataFrame to the list
  4. Call pd.concat() on the list once at the end

This is far more efficient than concatenating inside the loop. Concatenating one by one creates a new DataFrame on every iteration, which is slow on large datasets.

Method 5: Handling Different Columns — join Parameter

When the DataFrames you are concatenating have different columns, the join parameter controls what happens.

join=’outer’ (Default) — Keep All Columns

python

result_outer = pd.concat([df1, df3], join='outer', ignore_index=True)
print(result_outer)

Output:

employee_idnamedepartmentq1_salesq3_sales
0101AliceEngineering45000.0NaN
1102BobMarketing32000.0NaN
2103CharlieSales51000.0NaN
3103CharlieSalesNaN55000.0
4104DianaHRNaN31000.0
5105EveEngineeringNaN42000.0

With join='outer', all columns from both DataFrames are kept. Rows from df1 get NaN in the q3_sales column (which only exists in df3), and rows from df3 get NaN in the q1_sales column.

join=’inner’ — Keep Only Common Columns

python

result_inner = pd.concat([df1, df3], join='inner', ignore_index=True)
print(result_inner)

Output:

employee_idnamedepartment
0101AliceEngineering
1102BobMarketing
2103CharlieSales
3103CharlieSales
4104DianaHR
5105EveEngineering

With join='inner', only the columns that exist in both DataFrames are kept — employee_id, name, and department. The q1_sales and q3_sales columns are dropped because neither exists in both DataFrames.

Method 6: Horizontal Concatenation (Adding Columns) — axis=1

Setting axis=1 stacks DataFrames side by side i.e. adding more columns rather than more rows.

python

# Employee basic info
df_info = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'department': ['Engineering', 'Marketing', 'Sales']
})

# Employee performance scores
df_scores = pd.DataFrame({
    'q1_score': [92, 78, 85],
    'q2_score': [95, 82, 88]
})

# Combine side by side
result = pd.concat([df_info, df_scores], axis=1)
print(result)

Output:

namedepartmentq1_scoreq2_score
0AliceEngineering9295
1BobMarketing7882
2CharlieSales8588

The two DataFrames are joined by index position i.e. row 0 of df_info is placed next to row 0 of df_scores. This works correctly here because both DataFrames have the same index.

Warning: Index Alignment with axis=1

When using axis=1, pandas aligns rows by their index values and not by position. If your DataFrames have different indexes, you can get unexpected NaN values.

python

df_a = pd.DataFrame({'A': [1, 2, 3]}, index=[0, 1, 2])
df_b = pd.DataFrame({'B': [4, 5, 6]}, index=[1, 2, 3])

result = pd.concat([df_a, df_b], axis=1)
print(result)

Output:

AB
01.0NaN
12.04.0
23.05.0
3NaN6.0

Because index 0 only exists in df_a and index 3 only exists in df_b, those rows get NaN for the missing values. Always align your indexes before using axis=1 concatenation.

Method 7: Adding Hierarchical Keys with the keys Parameter

The keys parameter adds a hierarchical index (MultiIndex) to your concatenated DataFrame making it easy to identify which original DataFrame each row came from.

python

result = pd.concat([df1, df2], keys=['Q1', 'Q2'])
print(result)

Output:

employee_idnamedepartmentq1_sales
Q10101AliceEngineering45000
1102BobMarketing32000
2103CharlieSales51000
Q20101AliceEngineering48000
1102BobMarketing35000
2104DianaHR29000

Now you can access data from each original DataFrame by its key:

python

# Access only Q1 data
print(result.loc['Q1'])

# Access only Q2 data
print(result.loc['Q2'])

This is very useful when you need to track the source of each row after combining multiple DataFrames.

Method 8: Sorting Columns with sort Parameter

When concatenating DataFrames with different columns, the resulting column order can be unpredictable. Use sort=True to sort columns alphabetically.

python

result = pd.concat([df1, df3], join='outer', ignore_index=True, sort=True)
print(result)

Output:

Columns are now in alphabetical order: department, employee_id, name, q1_sales, q3_sales.

Method 9: Concatenating Series

pd.concat() works on Series as well as DataFrames.

python

s1 = pd.Series([1, 2, 3], name='A')
s2 = pd.Series([4, 5, 6], name='B')

# Stack Series vertically
vertical = pd.concat([s1, s2], ignore_index=True)
print(vertical)

# Place Series side by side as columns
horizontal = pd.concat([s1, s2], axis=1)
print(horizontal)

Vertical Output:

0    1
1    2
2    3
3    4
4    5
5    6

Horizontal Output:

AB
014
125
236

Concatenating Series with axis=1 creates a DataFrame where each Series becomes a column. This a very clean way to build DataFrames from multiple Series.

Method 10: Verifying the Concatenated Result

After concatenating, always verify your result to catch problems early.

python

result = pd.concat([df1, df2], ignore_index=True)

# Check shape — did you get the expected number of rows and columns?
print(f"Shape: {result.shape}")

# Check for unexpected NaN values
print(f"\nMissing values:\n{result.isnull().sum()}")

# Check for duplicate rows
print(f"\nDuplicate rows: {result.duplicated().sum()}")

# Preview the result
print(f"\nFirst few rows:\n{result.head()}")
print(f"\nLast few rows:\n{result.tail()}")

These checks take seconds and can save you from building analysis on top of a broken concatenation.

pd.concat() vs merge() vs join() — What Is the Difference?

A common question beginners ask is when to use concat vs merge vs join. They all combine DataFrames, but they serve very different purposes.

pd.concat()

Combines DataFrames by stacking them — either adding more rows (axis=0) or more columns (axis=1). It does not match rows based on key values, it simply places DataFrames next to or on top of each other.

python

# Stack rows from two DataFrames
pd.concat([df1, df2], ignore_index=True)

pd.merge()

Combines DataFrames by matching rows based on one or more key columns which is similar to SQL JOIN operations. Use merge when you need to bring related data together based on a shared identifier.

python

# Combine based on matching employee_id
pd.merge(df_employees, df_salaries, on='employee_id', how='left')

DataFrame.join()

Similar to merge but joins on the index by default. Most useful when the DataFrames share a meaningful index.

python

# Join on index
df_employees.join(df_scores)
Featurepd.concat()pd.merge()df.join()
Primary UseStack DataFramesMatch rows on key columnsJoin on index
Matches on Keys No Yes Yes (index)
SQL EquivalentUNION ALLJOINJOIN on index
Handles Different Columns Yes (outer/inner) Yes Yes
DirectionRows or ColumnsColumns onlyColumns only
Best ForStacking similar dataCombining related tablesIndex-based combination

Simple rule:

  • Same structure, need to stack → concat
  • Different tables, need to match on a key → merge
  • Joining on index → join

Real-World Use Cases

ETL Pipelines — Combining Monthly Files

The most common real-world use of pd.concat() — loading multiple files and combining them into one dataset for analysis or loading into a data warehouse.

python

import glob

# Find all CSV files in a folder
files = glob.glob('data/sales_*.csv')

# Load and concatenate all files
dataframes = [pd.read_csv(f) for f in files]
combined_sales = pd.concat(dataframes, ignore_index=True)

print(f"Total records loaded: {len(combined_sales)}")

Survey Data — Combining Response Batches

Survey data often comes in multiple batches like early responses, late responses, different distribution channels. Concat stacks them all into one analysis-ready dataset.

python

batch1 = pd.read_csv('survey_batch1.csv')
batch2 = pd.read_csv('survey_batch2.csv')
batch3 = pd.read_csv('survey_batch3.csv')

all_responses = pd.concat([batch1, batch2, batch3], ignore_index=True)
print(f"Total responses: {len(all_responses)}")

Machine Learning — Combining Train and Test Sets

After building separate train and test DataFrames, sometimes you need to concatenate them for preprocessing steps like fitting a scaler or encoder on the full dataset.

python

# Add a source label before concatenating
train_df['split'] = 'train'
test_df['split'] = 'test'

full_df = pd.concat([train_df, test_df], ignore_index=True)

# Fit preprocessor on full data
# Then split back using the 'split' column
train_processed = full_df[full_df['split'] == 'train']
test_processed = full_df[full_df['split'] == 'test']

API Data Collection — Paginated Results

When pulling data from an API with pagination, each page comes as a separate response. Concat combines all pages into one DataFrame.

python

all_pages = []

for page in range(1, 11):
    # response = requests.get(f'https://api.example.com/data?page={page}')
    # df_page = pd.DataFrame(response.json())
    df_page = pd.DataFrame({'id': range(page*10, page*10+10)})
    all_pages.append(df_page)

full_data = pd.concat(all_pages, ignore_index=True)
print(f"Total records: {len(full_data)}")

Advantages and Disadvantages of pd.concat()

Advantages

  • Simple and flexible — handles both vertical and horizontal concatenation
  • Accepts any number of DataFrames in a single call
  • Handles mismatched columns gracefully with outer and inner join options
  • Works on both DataFrames and Series
  • Efficient when used correctly — pass a list rather than concatenating one by one
  • Keys parameter provides built-in source tracking

Disadvantages

  • Does not match on key columns — not suitable for relational data combination
  • Mismatched indexes with axis=1 can produce unexpected NaN values
  • Duplicate rows are not automatically detected or removed
  • Performance degrades significantly if concatenating inside a loop instead of collecting in a list first

Common Mistakes to Avoid

  • Concatenating inside a loop instead of collecting in a list — Every concat call inside a loop creates a new DataFrame object, which is extremely slow. Always collect DataFrames in a list and call pd.concat() once at the end

python

# Wrong — slow
result = pd.DataFrame()
for df in dataframes:
    result = pd.concat([result, df])

# Correct — fast
all_dfs = []
for df in dataframes:
    all_dfs.append(df)
result = pd.concat(all_dfs, ignore_index=True)
  • Forgetting ignore_index=True — Duplicate index values after concatenation can cause confusing behavior when filtering or accessing rows by index
  • Not checking for NaN values after concatenation — Mismatched columns silently create NaN values. Always run isnull().sum() after a concat to verify
  • Using concat when merge is needed — If you need to combine DataFrames based on matching values in a key column, use merge, not concat
  • Not verifying the shape after concatenation — Always check that the resulting DataFrame has the expected number of rows and columns
  • Concatenating DataFrames with incompatible data types — If the same column has different data types across DataFrames (e.g., integer in one, string in another), pandas will upcast to object, which can break downstream analysis

Quick Reference Cheat Sheet

TaskCode
Stack two DataFrames verticallypd.concat([df1, df2], ignore_index=True)
Stack multiple DataFramespd.concat([df1, df2, df3], ignore_index=True)
Keep only common columnspd.concat([df1, df2], join='inner', ignore_index=True)
Keep all columns (fill NaN)pd.concat([df1, df2], join='outer', ignore_index=True)
Add columns side by sidepd.concat([df1, df2], axis=1)
Track source with keyspd.concat([df1, df2], keys=['Q1', 'Q2'])
Combine files in a looppd.concat([pd.read_csv(f) for f in files], ignore_index=True)
Sort columns alphabeticallypd.concat([df1, df2], sort=True, ignore_index=True)

Concatenating DataFrames is one of the most fundamental operations in pandas and pd.concat() is the tool that makes it fast, flexible, and reliable.

Here is a quick recap of everything we covered:

  • pd.concat() combines DataFrames by stacking rows (axis=0) or columns (axis=1)
  • Always use ignore_index=True when stacking rows unless you need the original indexes
  • Use join=’outer’ to keep all columns and join=’inner’ to keep only common columns
  • Use keys to track which original DataFrame each row came from
  • Collect DataFrames in a list and call pd.concat() once — never concatenate inside a loop
  • Always verify your result with shape, isnull().sum(), and a quick head/tail check
  • Use merge when you need to combine based on key column matching, not concat

Once you are comfortable with pd.concat(), combining data from multiple sources becomes one of the fastest and most routine steps in your data analysis workflow.

FAQs

What is the difference between pd.concat and pd.merge in pandas?

pd.concat stacks DataFrames by adding rows or columns without matching on key values. pd.merge combines DataFrames by matching rows based on shared key columns which is similar to SQL JOIN.

How do I concatenate DataFrames with different columns?

Use pd.concat() with join=’outer’ to keep all columns (filling missing values with NaN) or join=’inner’ to keep only the columns that exist in all DataFrames.

How do I reset the index after concatenating DataFrames?

Use ignore_index=True in pd.concat() to reset the index to a clean sequential range starting from 0.

Can I concatenate more than two DataFrames at once?

Yes. Pass a list of any number of DataFrames to pd.concat(), for example pd.concat([df1, df2, df3, df4], ignore_index=True).

Why do I get NaN values after concatenating DataFrames?

NaN values appear when the DataFrames have different columns (with join=’outer’) or when using axis=1 with DataFrames that have different index values. Check your column names and index alignment.

Is pd.concat() slow?

pd.concat() is fast when called once on a list of DataFrames. It becomes slow when called repeatedly inside a loop. Always collect DataFrames in a list first and call pd.concat() once at the end.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top