Finding Duplicate Rows in SQL
A common mechanism for defending against duplicate rows in a database table is to put a unique index on the column. However, at times, your data might come from external dirty data sources and your table will have duplicate rows. You'll have to remove duplicate rows in the table before a unique index can be added.
A great way to find duplicate rows is by using window functions – supported by most major databases.
Consider a follow table
dedup with duplicates:
Finding duplicate values in one column
The following query picks the
SELECT email, count(*) FROM dedup GROUP BY email HAVING COUNT(*) > 1;
which are the duplicate emails in the table with their counts. The next step is to number the duplicate rows with the
row_number window function:
SELECT row_number() over ( partition BY email), name, email FROM dedup;
We can then wrap the above query filtering out the rows with
row_number column having a value greater than 1.
SELECT * FROM ( SELECT row_number() over ( partition BY email), name, email FROM dedup ) t WHERE t.row_number < 2;
Notes about the ROW_NUMBER window function
row_number is a standard window function and supports the regular parameters for a window function. We'd like to point out two cases that are of interest:
- In a case where you want to pick a deduplicate row according a different criteria, you can make use of the
ORDERclause inside the window function to order the partition.
- In a case where you want to deduplicate on multiple columns, you can specific those columns are parameters to the
You are reading a series of Advanced SQL tricks.
Join 200+ data analysts who are using our weekly SQL tips to improve their skills. No spam, promise!