Have you ever received data split across dozens of CSV files — monthly sales reports exported one file per month, sensor readings saved one file per day, customer records divided by region into separate files and needed to analyse all of it together as one unified dataset?
In most cases, you might try to open each file in Excel, copy the data, paste it into a master sheet, and repeat that process for every file. Or you might manually concatenate files using command line tools, only to find the headers repeat in the middle of your combined output or encoding differences corrupt some of the values.
This is tedious, error prone, and completely impractical when you have hundreds of files to combine.
Python pandas solves this in a few lines of code. You point pandas at a folder of CSV files, it reads every file into a structured DataFrame, and combines them all into one unified dataset that you can clean, analyse, and export however you need. The same script works whether you have three files or three thousand.
In this guide we will break down merging multiple CSV files with pandas completely — what the different approaches are, how to write the code step by step, real world examples for different scenarios, and when to use simple concatenation versus a keyed merge.
What Is CSV Merging in Pandas?
Merging CSV files in pandas means reading multiple separate CSV files into individual DataFrames and combining them into a single DataFrame either by stacking them vertically (adding rows from each file below the previous one) or by joining them horizontally (aligning rows from different files based on a shared key column).
It answers the question: “I have the same kind of data spread across many files — how do I bring it all together so I can work with it as one dataset?”
Simple Analogy
Imagine you run a chain of restaurants and each location sends you a daily sales report as a separate spreadsheet. To understand total company performance you need all those individual reports combined into one master sheet. Doing it manually means opening fifty files and copying rows one at a time. Pandas does the equivalent automatically — reading every file, stacking the data, and handing you one complete table in seconds.
The Two Fundamental Approaches
Every CSV merging operation in pandas comes down to two approaches:
Concatenation (vertical stacking) means combining files that contain the same columns — appending rows from each file below the rows of the previous file. Monthly reports for the same metrics, daily logs with the same fields, regional exports with the same structure all suit concatenation. The result is a taller DataFrame with more rows.
Merging (horizontal joining) means combining files that share a key column but contain different columns — joining them side by side aligned on the shared key. Customer demographics in one file, customer purchase history in another, customer support tickets in a third joined on customer ID. The result is a wider DataFrame with more columns.
Setting Up Your Environment
Before writing any code, make sure pandas is installed:
python
pip install pandas
For reading large numbers of files efficiently, the glob module from Python’s standard library is also essential, it finds files matching a pattern without requiring you to list every filename manually:
python
import pandas as pd
import glob
import os
No additional installation is needed for glob or os, they are part of Python’s standard library.
Step by Step: Your First CSV Merge
Example 1: Concatenate Monthly Sales CSV Files
Setup — a folder containing twelve monthly sales files:
/sales_data/
sales_jan_2024.csv
sales_feb_2024.csv
sales_mar_2024.csv
...
sales_dec_2024.csv
Each file has identical columns: date, region, product, units_sold, revenue.
You want one combined DataFrame containing all twelve months of data.
Step 1: Find all CSV files in the folder
python
import pandas as pd
import glob
file_pattern = '/sales_data/*.csv'
all_files = glob.glob(file_pattern)
print(f"Found {len(all_files)} files")
print(all_files)
glob.glob() returns a list of file paths matching the pattern. The * wildcard matches any filename ending in .csv. Always print the file list first to confirm you are reading exactly the files you expect — not more, not fewer.
Step 2: Read each file into a DataFrame and store in a list
python
dataframes = []
for file in all_files:
df = pd.read_csv(file)
dataframes.append(df)
This loop reads each CSV file one at a time and appends the resulting DataFrame to a list. At this point you have twelve separate DataFrames stored in memory — one per file.
Step 3: Concatenate all DataFrames into one
python
combined_df = pd.concat(dataframes, ignore_index=True)
print(f"Combined shape: {combined_df.shape}")
print(combined_df.head())
pd.concat() stacks all DataFrames vertically. ignore_index=True resets the row index in the combined DataFrame so it runs 0, 1, 2, 3 continuously rather than restarting from 0 for each file’s data.
Step 4: Verify the result
python
print(combined_df.shape) # (total_rows, columns)
print(combined_df.dtypes) # check data types are correct
print(combined_df.isnull().sum()) # check for unexpected nulls
print(combined_df.duplicated().sum()) # check for duplicate rows
Always verify after merging. The row count should equal the sum of rows across all individual files. Unexpected nulls often signal mismatched column names across files. Duplicate rows often signal a file was included twice.
Step 5: Export the combined DataFrame
python
combined_df.to_csv('/output/sales_2024_combined.csv', index=False)
print("Export complete.")
index=False prevents pandas from writing the DataFrame’s row index as an extra column in the output file.
Best practice: Always print the shape before and after merging. If you expect 120,000 rows from twelve files of 10,000 rows each and the combined DataFrame has 130,000 rows, a file was read twice or one file had more rows than expected. Catching this immediately saves hours of debugging downstream.
How Pandas CSV Merging Works Internally
When pandas reads a CSV file with pd.read_csv(), it parses the file into a DataFrame, a two dimensional table with typed columns, a row index, and column labels drawn from the header row. Each DataFrame is an independent in memory object.
pd.concat() takes a list of DataFrames and aligns them by column name not by column position. If file A has columns in the order date, region, revenue and file B has them in the order region, date, revenue, pandas still concatenates them correctly because it matches on column names. Columns that exist in some files but not others are filled with NaN in the rows from files where that column was absent.
pd.merge() takes two DataFrames and aligns them by the values in a shared key column similar to a SQL JOIN. Rows from the left DataFrame are matched against rows in the right DataFrame where the key values are equal, and the columns from both DataFrames appear side by side in the result.
The Difference Between concat and merge
pd.concat — stacks DataFrames vertically (more rows, same columns)
DataFrame A: DataFrame B: Result:
date | revenue date | revenue date | revenue
Jan | 1000 Apr | 2200 Jan | 1000
Feb | 1500 May | 1800 Feb | 1500
Mar | 1200 Mar | 1200
Apr | 2200
May | 1800
pd.merge — joins DataFrames horizontally (same rows, more columns)
DataFrame A: DataFrame B: Result:
id | revenue id | region id | revenue | region
1 | 1000 1 | North 1 | 1000 | North
2 | 1500 2 | South 2 | 1500 | South
3 | 1200 3 | East 3 | 1200 | East
Real World Examples
Example 2: One Line Concatenation Using a List Comprehension
The loop from Example 1 can be compressed into a single readable line using a list comprehension which is the preferred approach for experienced pandas users:
python
import pandas as pd
import glob
combined_df = pd.concat(
[pd.read_csv(f) for f in glob.glob('/sales_data/*.csv')],
ignore_index=True
)
combined_df.to_csv('/output/combined.csv', index=False)
This reads every matching CSV file, creates a DataFrame for each, and concatenates them all in one expression. It is functionally identical to the loop approach but more concise. Use the loop version when you need to add per file processing steps like adding a filename column. Use the list comprehension when a straightforward stack is all you need.
Example 3: Adding a Source File Column to Track Which File Each Row Came From
Goal: After combining files you need to know which file each row originated from which is essential for auditing, debugging, and source based filtering.
python
import pandas as pd
import glob
import os
dataframes = []
for file in glob.glob('/sales_data/*.csv'):
df = pd.read_csv(file)
df['source_file'] = os.path.basename(file)
dataframes.append(df)
combined_df = pd.concat(dataframes, ignore_index=True)
print(combined_df['source_file'].value_counts())
os.path.basename(file) extracts just the filename from the full path — “sales_jan_2024.csv” instead of “/sales_data/sales_jan_2024.csv”. Adding this as a column before concatenation means every row in the combined DataFrame carries its origin file as a value you can filter and group on.
Result — source_file column in combined DataFrame:
| date | region | revenue | source_file |
|---|---|---|---|
| 2024-01-05 | North | 1200 | sales_jan_2024.csv |
| 2024-01-12 | South | 980 | sales_jan_2024.csv |
| 2024-02-03 | East | 1450 | sales_feb_2024.csv |
This column is invaluable when a data quality issue appears in the combined dataset and you need to trace it back to the specific source file.
Example 4: Merging CSV Files With Different Column Sets
Goal: Some files have extra columns that others lack and combine them without losing data from the files that have the extra columns.
python
import pandas as pd
import glob
dataframes = [pd.read_csv(f) for f in glob.glob('/data/*.csv')]
combined_df = pd.concat(dataframes, ignore_index=True, join='outer')
print("Columns in combined:", combined_df.columns.tolist())
print("Null counts:\n", combined_df.isnull().sum())
join='outer' (the default) keeps all columns from all files, filling with NaN where a file did not have that column. join='inner' keeps only columns that appear in every file — useful when you want a clean result with no NaN columns but at the cost of dropping columns from files that have extra data.
Handling the NaN values after merging:
python
combined_df['discount_pct'] = combined_df['discount_pct'].fillna(0)
combined_df['promo_code'] = combined_df['promo_code'].fillna('NONE')
Fill numeric NaN values with a sensible default (0 for a discount percentage that simply was not recorded). Fill string NaN values with a placeholder that downstream analysis can handle correctly.
Example 5: Merging CSV Files on a Shared Key Column
Goal: Three separate CSV files contain different information about the same customers. Combine them into one wide DataFrame aligned on customer ID.
python
import pandas as pd
demographics = pd.read_csv('/data/customer_demographics.csv')
purchases = pd.read_csv('/data/customer_purchases.csv')
support = pd.read_csv('/data/customer_support.csv')
merged = (
demographics
.merge(purchases, on='customer_id', how='left')
.merge(support, on='customer_id', how='left')
)
print(merged.shape)
print(merged.head())
Chaining .merge() calls combines three DataFrames in sequence. Each merge adds the columns from the next file aligned on customer_id.
Merge types and when to use each:
how='inner' keeps only customer IDs that appear in both DataFrames. Use when you only want customers with data in every file.
how='left' keeps all rows from the left DataFrame and fills with NaN where the right DataFrame has no matching row. Use when the left file is your master customer list and you want every customer to appear even if they have no purchases or support tickets.
how='outer' keeps all rows from both DataFrames, filling NaN on both sides where matches are absent. Use when neither file is the definitive master list and you want to capture all customer IDs from all sources.
how='right' keeps all rows from the right DataFrame. Rarely needed — a left merge with the files reversed achieves the same result more readably.
Example 6: Reading CSV Files From Nested Subfolders
Goal: Files are organised into subfolders by year and month. Read all CSV files recursively regardless of folder depth.
python
import pandas as pd
import glob
file_pattern = '/sales_data/**/*.csv'
all_files = glob.glob(file_pattern, recursive=True)
print(f"Found {len(all_files)} files across all subfolders")
combined_df = pd.concat(
[pd.read_csv(f) for f in all_files],
ignore_index=True
)
The ** wildcard in the pattern matches any number of nested subdirectories. recursive=True tells glob to actually descend into subdirectories when it encounters **. Without recursive=True the ** is treated as a literal directory name rather than a recursive wildcard.
Alternative using pathlib for more readable path handling:
python
from pathlib import Path
import pandas as pd
data_folder = Path('/sales_data')
all_files = list(data_folder.rglob('*.csv'))
combined_df = pd.concat(
[pd.read_csv(f) for f in all_files],
ignore_index=True
)
Path.rglob('*.csv') recursively finds all CSV files in every subfolder. pathlib is more readable than glob for complex path operations and handles cross platform path differences (Windows backslashes vs Unix forward slashes) automatically.
Example 7: Handling Encoding Differences Between Files
Goal: Some files in the folder were exported with different character encodings UTF-8 from modern systems and Latin-1 from legacy systems. Read all of them without UnicodeDecodeError failures.
python
import pandas as pd
import glob
def read_csv_safe(filepath):
for encoding in ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']:
try:
return pd.read_csv(filepath, encoding=encoding)
except (UnicodeDecodeError, pd.errors.ParserError):
continue
raise ValueError(f"Could not read {filepath} with any known encoding")
dataframes = [read_csv_safe(f) for f in glob.glob('/data/*.csv')]
combined_df = pd.concat(dataframes, ignore_index=True)
This function tries encodings in order of likelihood UTF-8 first since it is the modern standard, then Latin-1 and its variants for older Windows exports. The first encoding that reads the file without error is used. If no encoding works the function raises a clear error naming the problematic file rather than silently skipping it.
Merging Large Numbers of Files Efficiently
When combining hundreds or thousands of CSV files, memory and performance become important considerations.
Read files in chunks for very large files:
python
import pandas as pd
import glob
chunk_size = 50000
chunks = []
for file in glob.glob('/data/*.csv'):
for chunk in pd.read_csv(file, chunksize=chunk_size):
chunks.append(chunk)
combined_df = pd.concat(chunks, ignore_index=True)
chunksize makes pd.read_csv() return an iterator of DataFrames each containing at most chunk_size rows rather than loading the entire file at once. This keeps memory usage bounded when individual files are very large.
Write directly to output without holding everything in memory:
python
import pandas as pd
import glob
output_file = '/output/combined.csv'
first_file = True
for file in glob.glob('/data/*.csv'):
df = pd.read_csv(file)
df.to_csv(
output_file,
mode='a',
header=first_file,
index=False
)
first_file = False
mode='a' appends to the output file rather than overwriting it. header=first_file writes the column headers only for the first file — subsequent files append only their data rows. This approach processes one file at a time and writes directly to disk, using minimal memory regardless of total dataset size.
Concatenation vs Merging — When to Use Each
Both approaches combine CSV files into one DataFrame but they solve different problems and produce structurally different results.
Use pd.concat when:
All files contain the same columns (or similar columns that should stack as rows). Each file represents a different time period, geographic region, or batch of the same entity type. You want more rows, the combined dataset has the same structure as each individual file, just with more data.
Use pd.merge when:
Different files contain different attributes about the same entities. A shared key column (customer ID, product ID, transaction ID) links the files together. You want more columns — the combined dataset is wider than any individual file, showing multiple attributes per entity in one row.
| Scenario | Approach | Result |
|---|---|---|
| 12 monthly sales files, same columns | pd.concat | More rows, same columns |
| Customer demographics + purchase history | pd.merge on customer_id | Same rows, more columns |
| Daily log files, same structure | pd.concat | More rows, same columns |
| Products + prices + inventory | pd.merge on product_id | Same rows, more columns |
| Regional exports, same fields | pd.concat | More rows, same columns |
| Orders + shipping + returns | pd.merge on order_id | Same rows, more columns |
Common Limitations
All Files Must Have Compatible Column Names for Clean Concatenation
pd.concat aligns on column names — not column positions. If January’s file calls a column “Revenue” and February’s calls it “revenue”, pandas treats them as two different columns and fills both with NaN for the rows from the other file. Case sensitivity in column names is the most common source of unexpected NaN columns after concatenation. Always standardise column names before concatenating.
Merging on Non Unique Keys Multiplies Rows
If the key column used in a pd.merge is not unique in one or both DataFrames, pandas performs a many to many join and multiplies rows. Joining a customer file where each customer appears once against a purchases file where each customer appears multiple times produces one result row per purchase which is usually correct. Joining two files where both have duplicate key values produces every combination of matching rows — which is almost never what you want. Always check key column uniqueness before merging.
Memory Usage Scales With Total Data Size
pd.concat loads all DataFrames into memory simultaneously before combining them. If your files total 10GB of data, pandas needs 10GB or more of RAM to complete the concatenation. For datasets larger than available RAM, use the chunked or append to file approaches described earlier, or consider a tool designed for out of memory processing like Dask or DuckDB.
Data Type Inconsistencies Across Files
If one file stores a column as integers and another stores the same column as strings because some values like “N/A” forced it to read as object type — pandas will infer different dtypes for the same column from different files. After concatenation that column will be object type throughout. Always inspect dtypes after merging and cast columns to their correct types explicitly.
Encoding Mismatches Cause Silent Data Corruption
Reading a Latin-1 encoded file with UTF-8 encoding does not always raise an error — sometimes it reads successfully but produces corrupted characters for any non-ASCII values. Accented characters in names, currency symbols, and special punctuation are the most common casualties. Always specify encoding explicitly when you know it, and audit string columns for unexpected characters after merging files from mixed sources.
Common Mistakes to Avoid
Not resetting the index after concatenation. Without ignore_index=True, the combined DataFrame preserves the original index from each file. You end up with repeated index values — multiple rows all labelled index 0, multiple rows labelled index 1, and so on. This breaks any operation that assumes unique index values. Always use ignore_index=True when concatenating files.
Assuming glob returns files in alphabetical or chronological order. glob.glob() returns files in the order the operating system provides them which varies by OS and filesystem. On some systems this is alphabetical, on others it is arbitrary. If order matters for example, when the first file’s header row should define the column order — sort the file list explicitly: sorted(glob.glob(pattern)).
Concatenating files with duplicate header rows. If a file was manually created by copy pasting CSV content, the header row might appear multiple times within the file. After concatenation these appear as data rows containing column names. Check for this after merging: combined_df[combined_df['date'] == 'date'] should return zero rows. If it returns any, the source files contain embedded headers that need to be removed.
Merging without checking for duplicate key values first. Before any pd.merge call, check key uniqueness: df['customer_id'].duplicated().sum(). If duplicates exist and you did not expect them, investigate before proceeding — a merge on non-unique keys can silently multiply your row count by ten or a hundred times without raising any error.
Not specifying dtypes when reading files with mixed content. Pandas infers column types by sampling the first rows of each file. If early rows contain only integers but later rows contain strings like “N/A”, pandas may infer integer type and then error or coerce when it encounters the string. Specify dtypes explicitly for columns where you know the type: pd.read_csv(file, dtype={'customer_id': str, 'revenue': float}).
Forgetting that pd.concat copies data. pd.concat always creates a new DataFrame — it does not modify any of the input DataFrames. Assigning the result back to an existing variable does not update the original DataFrames. Always assign pd.concat() to a new variable name and verify the shape of that new combined object.
CSV Merging Cheat Sheet
| Task | Code Pattern |
|---|---|
| Find all CSV files in folder | glob.glob('/folder/*.csv') |
| Find CSV files recursively | glob.glob('/folder/**/*.csv', recursive=True) |
| Read and concatenate all files | pd.concat([pd.read_csv(f) for f in files], ignore_index=True) |
| Add source filename column | df['source'] = os.path.basename(file) |
| Merge two files on shared key | df1.merge(df2, on='id', how='left') |
| Merge three files on shared key | df1.merge(df2, on='id').merge(df3, on='id') |
| Keep all columns from all files | pd.concat(dfs, join='outer') |
| Keep only shared columns | pd.concat(dfs, join='inner') |
| Append to CSV without loading all | df.to_csv(out, mode='a', header=False, index=False) |
| Check row count after merge | combined_df.shape |
| Check for duplicate rows | combined_df.duplicated().sum() |
| Standardise column names | df.columns = df.columns.str.lower().str.strip() |
Merging multiple CSV files with Python pandas turns what would be hours of manual copy pasting into a script that runs in seconds and scales to any number of files without modification.
Here is the simplest summary of everything we covered:
Use glob.glob() to find all matching CSV files in a folder without hardcoding filenames. Use a list comprehension with pd.read_csv() to read every file into a DataFrame. Use pd.concat() with ignore_index=True to stack files with the same column structure into one combined DataFrame. Use pd.merge() to join files that share a key column and contain different attributes about the same entities. Always add a source file column when traceability matters. Always trim and standardise column names before concatenating files from mixed sources. Always verify shape, dtypes, null counts, and duplicate counts immediately after merging. Use chunked reading or append to file writing when total data size exceeds available memory.
Start with two or three CSV files and a basic pd.concat call. Verify the shape matches your expectations. Then add the source file column, handle encoding differences, and expand to the full folder. Once that pattern is solid, combine it with pd.merge for multi file joins and you have a complete toolkit for assembling any CSV based dataset from whatever fragmented form it arrives in.
FAQs
How do I merge multiple CSV files in Python pandas?
Use glob.glob() to find all CSV files, read each with pd.read_csv() in a list comprehension, and combine with pd.concat([pd.read_csv(f) for f in glob.glob('*.csv')], ignore_index=True).
What is the difference between pd.concat and pd.merge in pandas?
pd.concat stacks DataFrames vertically — combining files with the same columns into more rows. pd.merge joins DataFrames horizontally — combining files with different columns about the same entities into more columns, aligned on a shared key.
Why does my combined DataFrame have NaN columns after concatenation?
The most common cause is inconsistent column names across files — “Revenue” in one file and “revenue” in another are treated as different columns. Standardise column names before concatenating using df.columns = df.columns.str.lower().str.strip().
How do I combine CSV files that are in different subfolders?
Use glob.glob('/path/**/*.csv', recursive=True) to find CSV files in all nested subfolders. The ** wildcard matches any number of directory levels and recursive=True enables the traversal.
What should I do if my CSV files have different encodings?
Write a helper function that tries multiple encodings in sequence — UTF-8, Latin-1, cp1252 and uses the first one that reads without error. Always specify encoding explicitly when you know it to avoid silent character corruption.
How do I merge CSV files without running out of memory?
Either read and write one file at a time using df.to_csv(output, mode='a', header=False) to append directly to the output file, or use chunksize in pd.read_csv() to process large files in smaller pieces without loading the entire file into memory at once.