The Oracle (tm) Users' Co-Operative FAQ

What is the best way to identify and eliminate duplicate rows in a table ?


Author's name: Norman Dunbar;

Author's Email: Oracle@bountifulsolutions.co.uk

Date written: 17/07/2001

Oracle version(s): 8.1.7

I have a table which I know has some duplicate rows in. What is the best way to identify the duplicates and get rid of them ?

Back to index of questions


Identifying duplicate rows

It is possible to determine duplicate rows using the select with a count of all the rows which have the same values in the 'unique' fields as follows :

 
        select a,b,count(*)
        from test
        group by a,b
        having count(*) > 1; 

The above script will produce an output something like the following :

 
                 A          B   COUNT(*)
        ---------- ---------- ----------
                 1          2        259
                 2          2          5

From this we can see that the so-called unique columns a and b in this table are nowhere near unique. We need to get rid of 258 and 4 rows respectively to make the table properly unique. See below for a couple of methods of how to do the deletions.

Adding a constraint

Another method of identifying duplicates, is to try to add a constraint to the table, but we know this will fail because of all the duplicates. All we do is use the exceptions clause of the alter table add constraint command. First, we need to create a table to hold our exceptions. The following is a worked example :

 
        SQL> set echo on
        SQL> --
        SQL> -- Create the exceptions table first, Oracle supply a script
        SQL> -- to do just that. $ORACLE_HOME/rdbms/admin/utlexcpt.sql.
        SQL> --
        SQL> @?/rdbms/admin/utlexcpt
        SQL> create table exceptions(row_id rowid,
          2                          owner varchar2(30),
          3                          table_name varchar2(30),
          4                          constraint varchar2(30));
        Table created.
 
        SQL> --
        SQL> -- Now add the constraint on the 'unique' columns
        SQL> -- putting ALL the exceptions into the exceptions table
        SQL> -- that we just created.
        SQL> --
        SQL> ALTER TABLE test ADD constraint test_uq UNIQUE(a, b)
          2  exceptions INTO exceptions;
        ALTER TABLE test ADD constraint test_uq UNIQUE(a, b)
                                *
        ERROR at line 1:
        ORA-02299: cannot validate (DBADMIN.TEST_UQ) - duplicate keys found
 
        SQL> select count(*) from exceptions;
        
          COUNT(*)
        ----------
               264

Note that the total number of rows in the exceptions table is the sum of all the duplicated rows in the test table. If there are three rows in the test table with the same values in columns a and b, then all three will be in the exceptions table - after all, how does Oracle know which ones are wrong and which are right? Now all we have to do is get rid of the duplicates.

Getting rid of Duplicates

Getting rid of duplicates is always best done using the rowid to access the desired row. This is the quickest access method in Oracle. Of course, you first have to identify which rows you wish to keep!

Selecting which ones to keep

If the rows are the same in all the (important) columns, and you simply wish to get rid of any of the duplicates, leaving only one copy, then you don't have much work to do. Simply :

 
        SQL> --
        SQL> -- Identify the table contents including non-duplicates
        SQL> --
        SQL> select a,b,count(*) from test
          2  group by a,b;
        
                 A          B   COUNT(*)
        ---------- ---------- ----------
                 1          2        259
                 2          2          5
                 3          0          1
 
        SQL> --
        SQL> -- We need to get rid of 258 where A = 1 and B = 2 plus
        SQL> -- 4 where A = 2 and B = 2
        SQL> -- and we do not want to delete any where A = 3 and B = 0
        SQL> -- A total of 262 rows to be deleted.
        SQL> --
        SQL> -- Next (for sanity reasons) find out which rows we will keep
        SQL> --
        SQL> select min(rowid),a,b from test
          2  group by a,b;
 
        MIN(ROWID)                  A          B
        ------------------ ---------- ----------
        AAAAyvAAGAAAABYAAA          1          2
        AAAAyvAAGAAAABYAED          2          2
        AAAAyvAAGAAAABYAEI          3          0
        
        SQL> --
        SQL> -- Looks like three rows will remain after the deletions, so
        SQL> -- Lets do it !
        SQL> --
        SQL> delete from test where rowid not in (
          2  select min(rowid) from test group by a,b);
 
        262 rows deleted.
               
        SQL> --
        SQL> -- Sanity check ....
        SQL> --
        SQL> select rowid,a,b from test;
 
        ROWID                       A          B
        ------------------ ---------- ----------
        AAAAyvAAGAAAABYAAA          1          2
        AAAAyvAAGAAAABYAED          2          2
        AAAAyvAAGAAAABYAEI          3          0      
        
        SQL> --
        SQL> -- Looks identical to expected results, so
        SQL> -- commit to make it permanent
        SQL> --
        SQL> commit;
 
        Commit complete.

I have seen in the news groups a script like the following, to delete the duplicates from a table which is unfortunately flawed in that it gets rid of the duplicates and any row which isn't duplicated :

 
        SQL> delete from test where rowid not in (
          2  select min(rowid) from test group by a,b
          3  having count(*) > 1);
 
        263 rows deleted.
               
        SQL> --
        SQL> -- Sanity check ....
        SQL> --
        SQL> select rowid,a,b from test;
 
        ROWID                       A          B
        ------------------ ---------- ----------
        AAAAyvAAGAAAABYAAA          1          2
        AAAAyvAAGAAAABYAED          2          2

This has deleted the row where A = 3 and B = 0 and is not what is required. Why did this happen, check out the results of the inner select :

 
        SQL> rollback;
        Rollback complete.
 
        SQL> select min(rowid),a,b from test group by a,b
          2  having count(*) > 1
 
        MIN(ROWID)                  A          B
        ------------------ ---------- ----------
        AAAAyvAAGAAAABYAAA          1          2
        AAAAyvAAGAAAABYAED          2          2

The script deletes everything not listed above. Don't put the having clause in the inner select when deleting duplicates!

Another possible way to get rid of duplicates is to use a select distinct command and put the results into a different table. This, however, only works correctly when the rows are unique in all columns as the following example shows, we would still have a duplicate row where A = 1 and B = 2 :

 
        SQL> select distinct * from test;
 
                 A          B          C
        ---------- ---------- ----------
                 1          2          3
                 1          2          4
                 2          2          2
                 3          0          0

If this is suitable, the following is a worked example :

 
        SQL> --
        SQL> -- Create a new table holding only rows which are unique in ALL columns
        SQL> --
        SQL> create table new_test nologging
          2  as (select distinct * from test);
 
        Table created.
 
        SQL> --
        SQL> -- Check the contents of the new table ...
        SQL> --
        SQL> select * from new_test;
 
                 A          B          C
        ---------- ---------- ----------
                 1          2          3
                 1          2          4
                 2          2          2
                 3          0          0
 
        SQL> --
        SQL> -- Looks good, get rid of the old data
        SQL> --
        SQL> truncate table test;
 
        Table truncated.
        
        SQL> --
        SQL> -- Reinsert the non-duplicated data
        SQL> --
        SQL> insert into test (select * from new_test);
 
        4 rows created.
                               
        SQL> --
        SQL> -- Don't forget to commit
        SQL> --
        SQL> commit;
 
        Commit complete.
        
        SQL> -- 
        SQL> -- Drop the temporary table
        SQL> --
        SQL> drop table new_test;
        
        Table dropped.

Keeping specific rows

If there is a specific row that you need to keep out of the duplicates, then I'm afraid that there is no easy way in which you can use Oracle to extract the correct ones - unless you can write a script to identify the rows you want to keep. For this reason, it is best to avoid duplicates and set up unique constraints or primary keys to ensure that duplicates do not occur.

Performance considerations

I regularly have to remove duplicates from a fairly large table (about 200,000 rows) and I have found that the performance drops off when using :

 
        delete from test where rowid not in
        (select min(rowid) from test group by a,b);

This has the following explain plan :

 
        DELETE STATEMENT Optimizer=CHOOSE
          DELETE OF TEST
            FILTER
              TABLE ACCESS (FULL) OF TEST
              FILTER
                SORT (GROUP BY)
                  TABLE ACCESS (FULL) OF TEST

When I replaced it with the following instead, I was able to process at a much faster rate, seconds rather than 40 odd minutes :

 
        delete from test where rowid in 
               (select rowid from test 
                minus 
                select min(rowid) from test group by a,b);                                                             

Which has this explain plan :

 
        DELETE STATEMENT Optimizer=CHOOSE
          DELETE OF TEST
            NESTED LOOPS
              VIEW OF VW_NSO_1
                MINUS
                  SORT (UNIQUE)
                    TABLE ACCESS (FULL) OF TEST
                  SORT (UNIQUE)
                    SORT (GROUP BY)
                      TABLE ACCESS (FULL) OF TEST
              TABLE ACCESS (BY USER ROWID) OF TEST

I suspect the fact that I am deleting most of the rows in the table, and checking for membership of the set rather than non-membership is speeding the deletions up - there is more chance of finding a rowid in the set than of not finding one. (Please feel free to comment !)


Further reading:

All of the archives of the Oracle news groups on Google.

Author

Title/URL

Suggested by

Referee's comments

Howard Rogers

Detecting Duplicates

Jonathan Lewis

One of several “discussion documents” on the Dizwell website that take the time and space to examine Oracle features properly.

 

 

 

 

 


Back to top

Back to index of questions