SQL Query to Find Duplicate Records (With Examples)

SQL Query to Find Duplicate Records (With Examples)

Finding duplicate records is a common task in data analysis. Duplicates can lead to incorrect insights, inaccurate reporting, and poor decision-making.

Using SQL, you can easily identify and handle duplicate data.

In this guide, you’ll learn how to find duplicate records in SQL with practical examples.

What Are Duplicate Records?

Duplicate records are rows that have the same values in one or more columns.

For example:

idnameemail
1Johnjohn@email.com
2Johnjohn@email.com

These rows are duplicates based on name and email.

Basic Method: Using GROUP BY and HAVING

The most common way to find duplicates is by using GROUP BY with HAVING.

Example Table: customers

idnameemail

Query to Find Duplicates

SELECT name, email, COUNT(*) AS duplicate_count
FROM customers
GROUP BY name, email
HAVING COUNT(*) > 1;

Explanation

  • GROUP BY groups similar records
  • COUNT(*) counts occurrences
  • HAVING COUNT(*) > 1 filters duplicates

This query shows all duplicated values.

Finding Duplicate Rows Based on One Column

Example: Duplicate Emails

SELECT email, COUNT(*) AS duplicate_count
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

This helps when checking unique fields like email or phone number.

Finding Full Duplicate Rows

If you want to see all duplicate rows (not just grouped values), use a subquery.

Example

SELECT *
FROM customers
WHERE email IN (
SELECT email
FROM customers
GROUP BY email
HAVING COUNT(*) > 1
);

Use Case

This query returns all rows that are duplicates, not just the grouped summary.

Using ROW_NUMBER() to Identify Duplicates

Window functions provide a more advanced way to detect duplicates.

Example

SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS row_num
FROM customers;

Explanation

  • PARTITION BY email groups similar emails
  • ROW_NUMBER() assigns a sequence

Rows with row_num > 1 are duplicates.

Removing Duplicate Records

After identifying duplicates, you may want to remove them.

Example (Using CTE)

WITH duplicates AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS row_num
FROM customers
)
DELETE FROM duplicates
WHERE row_num > 1;

Always back up your data before deleting records.

Common Use Cases

Finding duplicates is useful for:

  • Data cleaning
  • Preparing datasets for analysis
  • Ensuring data quality
  • Avoiding double counting

Common Mistakes to Avoid

1. Forgetting GROUP BY

You must group columns correctly when counting duplicates.

2. Using WHERE Instead of HAVING

  • WHERE filters rows before grouping
  • HAVING filters after grouping

3. Ignoring Multiple Columns

Duplicates may depend on more than one column.

Best Practices

  • Always define what “duplicate” means in your dataset
  • Use multiple columns when necessary
  • Validate results before deleting data
  • Keep a backup

Finding duplicate records is a key skill in SQL and data analysis.

By using techniques like GROUP BY, HAVING, and window functions, you can quickly identify and manage duplicates in your dataset.

Mastering this will help you maintain clean and reliable data for analysis.

FAQs

How do I find duplicates in SQL?

Use GROUP BY with HAVING COUNT(*) > 1.

Can I find duplicates using one column?

Yes, group by that specific column.

What is the best way to remove duplicates?

Use ROW_NUMBER() and delete rows where the number is greater than 1.

Why are duplicates a problem?

They can lead to incorrect analysis and reporting.

What is the difference between WHERE and HAVING?

WHERE filters rows before grouping, HAVING filters after grouping.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top