Finding Duplicate Rows in SQL

A common mechanism for defending against duplicate rows in a database table is to put a unique index on the column. However, at times, your data might come from external dirty data sources and your table will have duplicate rows. You'll have to remove duplicate rows in the table before a unique index can be added.

A great way to find duplicate rows is by using window functions – supported by most major databases.


Consider a follow table dedup with duplicates:

name email
mike mike@example.com
mike k mike@example.com
sean sean@example.com
sean sean@example.com
taylor taylor@example.com

Finding duplicate values in one column

The following query picks the email column to deduplicate,

SELECT email,
    count(*)
FROM dedup
GROUP BY email
HAVING COUNT(*) > 1;

returns,

email count
mike@example.com 2
sean@example.com 2

which are the duplicate emails in the table with their counts. The next step is to number the duplicate rows with the row_number window function:

SELECT row_number() over ( partition BY email),
    name,
    email
FROM dedup;
row_number name email
1 mike mike@example.com
2 mike k mike@example.com
1 sean sean@example.com
2 sean sean@example.com
1 taylor taylor@example.com

We can then wrap the above query filtering out the rows with row_number column having a value greater than 1.

SELECT * FROM (
    SELECT row_number() over ( partition BY email),
        name,
        email
    FROM dedup ) t 
WHERE t.row_number < 2;
row_number name email
1 mike mike@example.com
1 sean sean@example.com
1 taylor taylor@example.com

Notes about the ROW_NUMBER window function

The row_number is a standard window function and supports the regular parameters for a window function. We'd like to point out two cases that are of interest:

  • In a case where you want to pick a deduplicate row according a different criteria, you can make use of the ORDER clause inside the window function to order the partition.
  • In a case where you want to deduplicate on multiple columns, you can specific those columns are parameters to the partition clause.