What’s the cost of a unique constraint?
Here’s a recent question from comp.databases.oracle.server:
If I have a table which has 1 billion rows (data in each row is relatively small), and declare one column (not the primary key column) as unique, will this cause big problem for inserting? Assume the inserting rate is 500/sec and the possible duplication is 0.001%. Unfortunately in my current dev environment, I can only have about 50M rows. We are still in design phase. I just want to see whether using "unique constraint" to keep data integrity for large database table with high insert rate is a good design idea or not . Our initial test with 30M does not show any problem. But I would like to know if anyone with a larger database has ever experienced some performance problems using this approach. "Big" means more than 100% performance hit, e.g. from 10ms (without unique constraint) to 20ms (with unique constraint). Duplicate data is not allowed. If detected, the insert should fail. If the "unique constrain" approach does not work, then the caller has to ensure the uniqueness before calling the database.
It’s the type of question that can lead to all sorts of interesting discussion points, not necessarily directly related to the original question. I toyed briefly with giving some sort of answer, but then realized the first thing I’d have to do is ask a load of questions, so I thought I’d chat around the topic in a short article.
The first thought, of course, is one that someone else on the newsgroup raised – does it matter what it costs to enable the constraint, what would it cost if you had duplicate data in place ? If you’ve got to have uniqueness, you’ve got to have uniqueness.
This then leads to a discussion of how uniqueness could be enforced from the code if the constraint was not in place. Possibly the poster is really only asking about the difference in cost between a unique index and a unique constraint; possibly the poster has a coding strategy that will somehow ensure that the values coming into the table are unique – but such coding strategies are almost always doomed to fail, especially given Oracle’s read-consistency mechanisms. There are two main problems:
First – how do you determine that a value to be inserted does not already exist – presumably you have to check to see if it exists, and in the absence of a suitable index, such checks will be very expensive. But possibly the inserts all come from a single process which somehow keeps track of the most recent value inserted, and knows something about the data that ensures that the next value cannot already exist in the database. (The obvious example of this would be some sequential numbering or time-stamp mechanism).
Second – and more significant in general for the “code it in the front end” – if you have concurrent processes doing the inserting, then any check carried out by one process will not see rows inserted and not yet committed by another process, so you could still end up attempting to insert duplicates, even when you have checked for prior existence.
So let’s assume the question is really about the difference between a unique index and a unique constraint. One simple investigation is to create a test and run with event 10046 enabled at level 4 so that we can see if there are any obvious operational differences when duplicate data is inserted into the table. Something like:
drop table t1;
create table t1 (n1 number);
-- Option 1 – create a unique index
create unique index t1_uk on t1(n1);
-- Option 2 – create a unique constraint
-- alter table t1 add constraint t1_uk unique(n1);
insert into t1 values(1);
insert into t1 values(1);
ERROR at line 1:
ORA-00001: unique constraint (TEST_USER.T1_UK) violated
Note that it doesn’t matter which of the options we choose –creating a unique index or creating a unique constraint – the error message (in 184.108.40.206 at least) is the same. However, if we look at the content of the trace file, we find the following:
Where we have created the unique constraint, there is one extra recursive SQL statement like the following – which is Oracle doing a lookup to find the name of the constraint that had been broken by the duplicate key:
con$ c, cdef$ cd, user$ u
where c.con# = cd.con#
and cd.enabled = :1
and c.owner# = u.user#
But if you create a unique index, rather than a unique constraint, you get two recursive SQL statements, the one above where Oracle checks to find the name of the constraint that matches the object ID of the unique index, then the one below that tries to find the name of the unique index.
obj$ o, user$ u
where o.obj# = :1
and o.owner# = u.user#
This seems to suggest that it is better to use a unique constraint, rather than a unique index – because the penalty for a duplicate entry is more expensive.
Of course, there are other factors to consider – when you hit the duplicate key, Oracle has to rollback the statement that inserted the duplicate. Even if the statement is a single-row insert, there will be a genuine insert and a genuine rollback – both of which generate redo. So if performance is critical, you need to consider the total cost of the error. For example:
Are you doing single row, small array, or large array inserts. Single row is more expensive on undo and redo in general, but large array could be very expensive each time you hit a duplicate, because the entire array (up to the error) has been inserted and has to be rolled back.
Can you squeeze out some benefit by arranging the order in which the indexes were created. If you create the PK index, a couple of non-unique indexes, and then the UK index in that order, an insert will modify the table, the PK index, the non-unique indexes, and then spot the duplicate on the UK – requiring 4 rollback steps. If the unique key is the only thing that might cause duplication errors, then perhaps you should create the unique index as the first index.
Then you get on to problems of concurrency – and where contention may occur; the possibly benefits in a highly concurrent system of using reverse key indexes, or rigging the unique key with an extra process-related prefix, and so on. And if you manage to randomize the insertions across a large index (instead of having a couple of hot index blocks) what effect might this have on the number of index blocks you have to buffer to avoid excessive physical reads as you hunt for the index leaf block that an index entry is supposed to go into.
As I said at the start – the question is an interesting one. When you start playing around with fairly large volumes of data then there are many different, competing, aspects of performance that you may have to consider very carefully.