Finding duplicate records is a common task in data analysis. Duplicates can lead to incorrect insights, inaccurate reporting, and poor decision-making.
Using SQL, you can easily identify and handle duplicate data.
In this guide, you’ll learn how to find duplicate records in SQL with practical examples.
What Are Duplicate Records?
Duplicate records are rows that have the same values in one or more columns.
For example:
| id | name | |
|---|---|---|
| 1 | John | john@email.com |
| 2 | John | john@email.com |
These rows are duplicates based on name and email.
Basic Method: Using GROUP BY and HAVING
The most common way to find duplicates is by using GROUP BY with HAVING.
Example Table: customers
| id | name |
|---|
Query to Find Duplicates
SELECT name, email, COUNT(*) AS duplicate_count
FROM customers
GROUP BY name, email
HAVING COUNT(*) > 1;
Explanation
GROUP BYgroups similar recordsCOUNT(*)counts occurrencesHAVING COUNT(*) > 1filters duplicates
This query shows all duplicated values.
Finding Duplicate Rows Based on One Column
Example: Duplicate Emails
SELECT email, COUNT(*) AS duplicate_count
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
This helps when checking unique fields like email or phone number.
Finding Full Duplicate Rows
If you want to see all duplicate rows (not just grouped values), use a subquery.
Example
SELECT *
FROM customers
WHERE email IN (
SELECT email
FROM customers
GROUP BY email
HAVING COUNT(*) > 1
);
Use Case
This query returns all rows that are duplicates, not just the grouped summary.
Using ROW_NUMBER() to Identify Duplicates
Window functions provide a more advanced way to detect duplicates.
Example
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS row_num
FROM customers;
Explanation
PARTITION BY emailgroups similar emailsROW_NUMBER()assigns a sequence
Rows with row_num > 1 are duplicates.
Removing Duplicate Records
After identifying duplicates, you may want to remove them.
Example (Using CTE)
WITH duplicates AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS row_num
FROM customers
)
DELETE FROM duplicates
WHERE row_num > 1;
Always back up your data before deleting records.
Common Use Cases
Finding duplicates is useful for:
- Data cleaning
- Preparing datasets for analysis
- Ensuring data quality
- Avoiding double counting
Common Mistakes to Avoid
1. Forgetting GROUP BY
You must group columns correctly when counting duplicates.
2. Using WHERE Instead of HAVING
WHEREfilters rows before groupingHAVINGfilters after grouping
3. Ignoring Multiple Columns
Duplicates may depend on more than one column.
Best Practices
- Always define what “duplicate” means in your dataset
- Use multiple columns when necessary
- Validate results before deleting data
- Keep a backup
Finding duplicate records is a key skill in SQL and data analysis.
By using techniques like GROUP BY, HAVING, and window functions, you can quickly identify and manage duplicates in your dataset.
Mastering this will help you maintain clean and reliable data for analysis.
FAQs
How do I find duplicates in SQL?
Use GROUP BY with HAVING COUNT(*) > 1.
Can I find duplicates using one column?
Yes, group by that specific column.
What is the best way to remove duplicates?
Use ROW_NUMBER() and delete rows where the number is greater than 1.
Why are duplicates a problem?
They can lead to incorrect analysis and reporting.
What is the difference between WHERE and HAVING?
WHERE filters rows before grouping, HAVING filters after grouping.