Why Your SQL Query Is Returning Duplicate Rows

Why Your SQL Query Is Returning Duplicate Rows

You write what looks like a perfectly reasonable query. You run it. The result set comes back with 4,000 rows when you were expecting 400. You scroll through and notice the same customer appearing five times, the same order ID repeating across three rows, the same transaction showing up in different combinations. Nothing in the query looks obviously wrong, but the output is clearly broken.

Duplicate rows in SQL are one of the most common frustrations for analysts at every level. They break aggregate calculations, corrupt dashboard numbers, and make reports look unprofessional. And the part that makes them genuinely tricky is that SQL is usually not doing anything wrong. It is giving you exactly what you asked for. The problem is that what you asked for and what you actually wanted are two different things.

This guide walks through every major reason SQL queries return duplicate rows, how to diagnose which one is affecting your query, and the right fix for each cause.

What “Duplicate Rows” Actually Means in SQL

Before jumping into causes, it helps to separate two different types of duplicates because they have different fixes.

True duplicates are rows where every single column contains the same value as another row. The data itself has been entered or loaded twice. The fix for true duplicates lives in the data, not the query.

Logical duplicates are rows that repeat a business entity more than you expected but are technically different rows. The same customer appearing five times because they placed five orders is not a bug. It is the correct output of a one-to-many relationship. The fix for logical duplicates lives in the query logic.

Most analysts who complain about duplicate rows are dealing with logical duplicates. Knowing which type you have is the first step.

Cause 1: A JOIN on a Non-Unique Key

This is the most common cause of unexpected duplicate rows and it catches people who fully understand JOINs just as often as it catches beginners.

A JOIN works by matching every row in the left table to every row in the right table where the join condition is true. If the join key is unique on both sides, you get one match per row. If the join key is not unique on one or both sides, you get every matching combination, which multiplies your rows.

Here is what this looks like:

SELECT o.order_id, c.customer_name, o.order_date
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

If the customers table has three rows with the same customer_id because of a data issue or versioning logic, every single order for that customer now appears three times in the result. The query is correct. The assumption that customer_id is unique in the customers table was wrong.

How to diagnose it: Before writing a JOIN, check whether the key you are joining on is actually unique in each table.

SELECT customer_id, COUNT(*) AS occurrences
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;

If this query returns any rows, your join key has duplicates on that side and your JOIN will multiply rows accordingly.

Cause 2: A Many-to-Many Relationship Without Aggregation

Some tables have legitimate one-to-many or many-to-many relationships. One customer can have many orders. One order can have many products. One product can belong to many categories. When you join across these relationships without collapsing the many side first, the result set expands by design.

Say you want total revenue per customer and you join an orders table to an order_items table to get the line item amounts:

SELECT c.customer_name, SUM(oi.amount) AS total_revenue
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY c.customer_name;

This works correctly because GROUP BY collapses the expansion back into one row per customer. But if you remove the GROUP BY and the SUM and just SELECT the raw columns, you get one row per line item per order per customer. Not a bug. Just the natural output of joining across one-to-many relationships without aggregation.

The fix is to either aggregate with GROUP BY or to pre-aggregate one of the tables in a subquery before joining:

SELECT c.customer_name, order_totals.total_revenue
FROM customers c
JOIN (
    SELECT customer_id, SUM(amount) AS total_revenue
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    GROUP BY customer_id
) order_totals ON c.customer_id = order_totals.customer_id;

Pre-aggregating before the join is often cleaner and easier to debug than letting the explosion happen and trying to collapse it afterward.

Cause 3: True Duplicate Rows in the Source Table

Sometimes the data itself is the problem. Rows get loaded twice from an ETL pipeline that ran twice. A CSV import appended instead of replaced. A form was submitted twice in the same session. The result is a table that contains physically identical rows or near-identical rows that differ only in a timestamp or an auto-incremented ID.

Finding true duplicates is straightforward:

SELECT customer_id, email, created_at, COUNT(*) AS occurrences
FROM customers
GROUP BY customer_id, email, created_at
HAVING COUNT(*) > 1;

If you just need to remove them from a query result without touching the underlying table, DISTINCT works:

SELECT DISTINCT customer_id, email, created_at
FROM customers;

But DISTINCT is a band-aid on a wound that needs stitches. It hides the problem without fixing it. If duplicates are appearing in the source table regularly, the right fix is upstream in the pipeline that loads the data. Use a UNIQUE constraint on the columns that should never repeat, or add deduplication logic to the ETL process before data lands in the table.

Cause 4: Missing or Wrong JOIN Condition

A JOIN with an incomplete condition or an accidental cross join multiplies every row in one table by every row in the other. If your left table has 200 rows and your right table has 300 rows, a cross join returns 60,000 rows.

This happens more often than it should because of a small syntax mistake:

SELECT o.order_id, p.product_name
FROM orders o, products p;

That old-style comma syntax with no WHERE clause is a cross join. Every order gets matched with every product. With 500 orders and 200 products you get 100,000 rows. Modern SQL syntax makes this harder to do accidentally because JOIN without ON raises an error, but legacy code written with comma-style table syntax in the FROM clause is still common in older codebases.

Always verify your JOIN condition is complete and that it references the right columns in both tables:

SELECT o.order_id, p.product_name
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id;

Cause 5: Window Functions Without DISTINCT or Filtering

Window functions like ROW_NUMBER, RANK, and DENSE_RANK add a column to every existing row rather than collapsing rows the way GROUP BY does. This is by design, but it means your result set keeps every row and simply labels each one. If you use ROW_NUMBER to deduplicate and forget to filter on the result, you get all the original rows plus their row numbers.

SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
    FROM customers
) ranked;

That query returns every row with a row number attached. To get only the most recent record per customer you need the WHERE clause:

SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
    FROM customers
) ranked
WHERE rn = 1;

Without WHERE rn = 1, the window function does nothing to reduce duplicates. It just labels them.

How to Diagnose Duplicates Systematically

When you see unexpected row counts, run through this checklist before changing anything in the query.

Check the row count of each table you are joining and compare to the result count. If the result has more rows than the largest table, a JOIN is multiplying rows.

Check whether the join key is unique in each table using COUNT and HAVING COUNT(*) > 1 on the key column.

Run each table independently and check for duplicates before adding JOINs. Isolating the source of the problem is faster than debugging a five-table JOIN all at once.

Add COUNT(*) to the SELECT and compare it to COUNT(DISTINCT key_column). If they are different, you have duplicates on that key.

SELECT COUNT(*) AS total_rows,
       COUNT(DISTINCT customer_id) AS unique_customers
FROM customers;

If total_rows and unique_customers are not equal, customer_id is not unique in that table.

Common Mistakes to Avoid

Reaching for DISTINCT first without diagnosing the cause. DISTINCT removes rows from the output but does not fix the underlying data or logic problem. Using it to paper over a bad JOIN means your row count looks right but your aggregate numbers are often still wrong because the multiplication happened before DISTINCT removed rows.

Assuming a column is a unique identifier without checking. Natural keys like email addresses, phone numbers, and even order IDs can have duplicates in practice. Always verify uniqueness before joining on a column, especially in tables you did not build yourself.

Forgetting that UNION ALL preserves duplicates while UNION removes them. If you are combining two result sets with UNION ALL and both queries return the same rows for overlapping data, you get duplicates. Use UNION when the two sets might overlap and you only want distinct rows in the final result.

Joining on NULL values. NULL does not equal NULL in SQL. If your join key contains NULL values in either table, those rows will never match and you may get unexpected gaps rather than duplicates. But if you join on a non-key column that contains NULL in multiple rows, the behavior depends on the database. Know your NULL handling before joining on any column that can be null.

SQL Duplicate Cheat Sheet

CauseHow to Spot ItFix
Non-unique join keyKey appears multiple times in source tablePre-aggregate or filter before joining
Many-to-many relationshipRow count exceeds largest tableAdd GROUP BY or pre-aggregate one side
True duplicates in sourceIdentical rows in COUNT / HAVING queryDISTINCT in query or UNIQUE constraint on table
Cross join or missing ON clauseRow count = left rows times right rowsAdd complete ON condition to JOIN
Window function without filterAll rows returned with row number columnAdd WHERE rn = 1 after window function
UNION ALL on overlapping setsSame rows appear twice in resultReplace UNION ALL with UNION

Duplicate rows in SQL almost always have a clear, fixable cause. The query is rarely broken. The assumption you brought to the query about how the data is structured usually is. Checking whether your join keys are unique before you write the JOIN, and understanding the grain of every table you are working with, removes the vast majority of duplicate problems before they happen.

When duplicates do appear in the output, diagnose before fixing. A targeted fix that addresses the actual cause produces a query that is correct and trustworthy. Wrapping everything in DISTINCT produces a query that looks correct but hides problems that will show up somewhere else downstream.

FAQs

Why is my SQL query returning duplicate rows?

The most common reasons are joining on a non-unique key, combining tables that have a one-to-many or many-to-many relationship without aggregating, or having true duplicate rows in the source data. Check whether your join key columns are unique in each table before diagnosing further.

How do I remove duplicate rows in SQL?

Use SELECT DISTINCT to remove exact duplicate rows from the query output. To keep only one row per group based on a specific column, use ROW_NUMBER() OVER (PARTITION BY key_column ORDER BY tiebreaker_column) in a subquery and filter WHERE rn = 1. For true duplicates in the source table, fix the upstream data load process and add a UNIQUE constraint.

What is the difference between DISTINCT and GROUP BY for removing duplicates?

DISTINCT removes rows where all selected columns are identical. GROUP BY collapses rows into groups and lets you apply aggregate functions like SUM and COUNT to each group. Use DISTINCT when you want unique rows with no aggregation. Use GROUP BY when you want one row per group with calculated aggregate values.

Why does my SQL JOIN return more rows than the original table?

A JOIN returns more rows than the original table when the join key is not unique in one of the tables. If five rows in the right table share the same key value, each matching row in the left table produces five rows in the result. Check for key uniqueness using GROUP BY on the join column with HAVING COUNT(*) > 1.

Should I always use DISTINCT to fix SQL duplicates?

No. DISTINCT is a last resort, not a first response. It removes rows from the output without fixing the underlying cause. If a bad JOIN is multiplying rows, DISTINCT removes the visible duplicates but your aggregate calculations are still wrong because the multiplication happened before DISTINCT ran. Always diagnose the cause first and fix the query logic or data before reaching for DISTINCT.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top