Duplicate records can cause serious problems in data analysis—leading to incorrect reports, double counting, and misleading insights.
Using SQL, you can efficiently identify and remove duplicate rows from your dataset.
In this guide, you’ll learn different ways to remove duplicates in SQL, from beginner to advanced methods.
What Are Duplicate Records?
Duplicate records are rows that contain the same values in one or more columns.
Example:
| id | |
|---|---|
| 1 | a@email.com |
| 2 | a@email.com |
Here, the email column has duplicate values.
Method 1: Using DISTINCT (Simple Approach)
If you just want unique records (without deleting data), use DISTINCT.
SELECT DISTINCT email
FROM customers;
When to Use
- When you only need unique results
- When you don’t want to modify the table
Method 2: Using GROUP BY
Another way to get unique values:
SELECT email
FROM customers
GROUP BY email;
This works similarly to DISTINCT
Method 3: Remove Duplicates Using ROW_NUMBER() (Best Method)
This is the most reliable method for deleting duplicates.
Step 1: Identify Duplicates
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS row_num
FROM customers;
Step 2: Delete Duplicates
WITH duplicates AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS row_num
FROM customers
)
DELETE FROM duplicates
WHERE row_num > 1;
Explanation
PARTITION BY emailgroups duplicate recordsROW_NUMBER()assigns a unique number- Rows with
row_num > 1are duplicates
Method 4: Using DELETE with Subquery
DELETE FROM customers
WHERE id NOT IN (
SELECT MIN(id)
FROM customers
GROUP BY email
);
Explanation
- Keeps the first occurrence
- Deletes the rest
Choosing the Right Method
Use DISTINCT When:
- You only need to view unique data
- You don’t want to delete records
Use ROW_NUMBER() When:
- You want to safely remove duplicates
- You need control over which row to keep
Use Subquery When:
- You want a simpler delete approach
Important Tips Before Deleting Data
1. Always Backup Your Data
Deleting duplicates is permanent.
2. Define What “Duplicate” Means
It could be based on:
- One column (email)
- Multiple columns (name + email)
3. Test Your Query First
Run a SELECT version before DELETE.
Common Mistakes to Avoid
- Deleting without checking results
- Using wrong columns in PARTITION BY
- Not handling NULL values
- Removing important records accidentally
Real-World Use Cases
Removing duplicates is important for:
- Customer databases
- Sales records
- Financial data
- Data warehouse cleaning
Clean data leads to accurate insights.
Removing duplicates in SQL is a critical skill for data analysts.
While methods like DISTINCT are useful for viewing clean data, using ROW_NUMBER() gives you full control when deleting duplicates.
Always proceed carefully, validate your results, and keep backups to avoid data loss.
FAQs
What is the best way to remove duplicates in SQL?
Using ROW_NUMBER() with a CTE is the safest and most flexible method.
What does DISTINCT do?
It returns unique values without deleting data.
Can I remove duplicates based on multiple columns?
Yes, use multiple columns in PARTITION BY.
Is deleting duplicates risky?
Yes. Always back up your data before deleting.
What is the difference between DISTINCT and GROUP BY?
Both return unique values, but GROUP BY is used with aggregations.