**Ch.10 Code
Depot (29 ^{th} Nov 2005)**

**Join Cardinality Cracked (29 ^{th} Nov 2005)**

**More on Join Cardinality (12 ^{th} Dec 2005)**

Typos (1^{st}
Dec 2009)

Page 271: Congratulations and thanks are due to Alberto Dell’Era,
who read the section on ** “Biased Joins”**, and decided to do
some further testing with script

The critical change in the trace files appeared in lines like the following,
where the number of rows in the second table affected the ** sel**
(

Join cardinality: 81081 = outer (100) * inner (30000) * sel (2.7027e-002)

Having identified the critical point of change, Alberto did some arithmetical manipulation, spotted a pattern, derived a formula that generated the pattern and produced a rational for that formula – at which point he sent me an email to explain what he had done. Then he sent me another email with a URL referencing a document about a standard formula of probability theory, pointing out that the two formulae (his derivation, and the standard formula) were equivalent – at which point I was struck by a flash of 20-20 hind-sight (or the ‘doh’ moment as Tom Kyte sometimes calls it) - sometimes things are so straightforward after someone has explained it all.

The concept is simple. Think of a bag
holding 50 balls, each numbered uniquely. Select a ball from the bag, note down
the number, and ** put the ball back into th
bag**. Repeat 100 times. At the end of the experiment, how many different
numbers will you have selected? Obviously you can’t say – at one
very unlikely extreme you might have picked the same ball 100 times, at the
other extreme you might have picked all 50 of them at least once each.

Repeat the experiment a few thousand times and note how many different
numbers you get each time you select 100 balls. Is it possible to say anything
about thousands of different results? The answer is yes. Surprising though it
may seem, there is a standard formula for the average (or expected) number of
distinct values that you will get by picking a ball 100 times when there are 50
possible numbers on the balls. It will be:
50 * (1 – (49/50)^{100}). (In linear notation: 50 * (1 – (49/50) ^ 100), where ^ means
“to the power of”.) The answer is 43.36.

How does this apply to Oracle and join cardinality? Consider the following *where clause*:

where

t1.joincol = t2.joincol -- assume that t1.joincol has 50 distinct values, and t2.joincol has 80.

and t1.filtercol = 99 -- identifies 100 rows in t1 (say)

and t2.filtercol = 99 -- identifies 150 rows in t2 (say)

On page 303 I gave you the ** “Alternative Viewpoint”**
of a join, saying you could consider the two filter predicates first, then do a
Cartesian join of the resulting two data sets, and then look at the join
condition as two possible predicates on the Cartesian join, using the
selectivity of the more precise condition to generate the final cardinality. So
the

A) t1.joincol = {constant}

B) {constant} = t2.joincol

Looking at just the first of these predicates, what is the selectivity of *t1.joincol = {constant}*. We know that the answer is (traditionally) ** 1/num_distinct.
**But just how many distinct values does

* Number
of balls in bag = user_tab_columns.num_distinct = 50*

* Number
of times we pick = filtered cardinality of table = 100*

* Probable
number of distinct values picked = probable number of distinct values for
t1.joincol = 50 * (1 – (49/50) ^{100} ) =
43.36 *

Oracle takes the ceiling of this value (i.e. 44) and uses that as the ** “adjusted
num_distinct”** in its standard formula.

Join Selectivity =

((num_rows(t1) - num_nulls(t1.joincol)) / num_rows(t1)) *

((num_rows(t2) - num_nulls(t2.joincol)) / num_rows(t2)) /

greater(num_distinct(t1.joincol), num_distinct(t2.joincol))

The equivalent value for t2.joincol would be: ceiling(*80 * (1 – (79/80) ^{150} ) = 68*

The general formula for calculating the * expected*
number of distinct values when you have

N * ( 1 – (1
– 1/N )^{n }) which
can also be written as N *
(1 – ((N - 1 ) / N) ^{n })

I used the second form of the formula above in my worked example, but the
first form appears in the reference that Alberto sent to me. It is interesting
to note, then, that if you look at the term *((N
- 1 ) / N) ^{n}* – for large values of

For further reading, see: http://www.ds.unifi.it/VL/VL_EN/urn/urn8.html

If you apply this formula to the example in section ** “Difficult Bits”**,
the data generated by script

There are still some rules that have to be uncovered for situations like
‘extreme differences between the number of distinct values in the join
columns’, and ‘what to do if combined selectivities
are less than 1/total rows in table”; not to mention how you apply the
formula in the presence of the *‘concatenated
index sanity check’*.

However, Alberto has given us a massive step forward in working out many of the strange numbers that appear as join cardinalities, and earned a place in my personal hall of fame alongside Wolfgang Breitling). For future reference, I shall be referring to this result as the Alberto Dell’Era Deduction.

Note: I have added *join_card_new_01.sql*
to the code depot for this chapter, and an updated version of *join_card_10.sql*.

Following on from the previous notes on Join Cardinality – Alberto Dell’Era has been pursuing the problem of accuracy. Knowledge often progresses by steps, and a first approximation often introduces the annoying error that leads to a better understanding and approximation.

In his latest email to me, Alberto has pointed out that there is an even
more appropriate formula from the world of probability theory that can be
applied to database selectivity, and this is the formula relating to ** selection
without replacement**. (The previous section was using a formula for

I hope to go into this in more detail later, but put simply, the formula says:

Let ** nr** be the number of rows
in the table

Let ** nd**
be the number of distinct values in a

Let ** s** be the number of rows
selected from the table by predicates on

Then the average number of distinct values (i.e. the number we should put into num_distinct in the Metalink formula) is:

nd
* ( 1 – ((nr – s) / nr) ^{nr / nd}
) which can also be written
as nd
* ( 1 – (1 – s / nr) ^{nr / nd} )

Although this formula does not seem to be very similar to the earlier formula, it is known to give results which are very similar to the earlier formula for ‘small’ sample sizes. This, presumably, is why the previous formula gave fairly consistent answers in the test cases I described in my earlier update.

Alberto has also demonstrated the suitability of this formula for queries of
the type: *select distinct where …*

Page 265, second formula in Basic Join Cardinality: *Card(**Pj**) = card(t1) * card(t1) * sel(Pj)* should be *Card(Pj) = card(t1)
* card( t2) * sel(Pj). *Thanks to Kun Sun for reporting this.

Page 270, first paragraph after code extract: “I have 10,000 rows in each table” should be “I have 1,000 rows in each table” – but you probably spotted that one.

Page 271, three paragraphs above heading Join Cardinality for Real SQL: “Is it a coincidence that 1,000/30 = 333 ?” should be “Is it a coincidence that 10 * 1,000 / 30 = 333 ?”. Thanks to Kun Sun for reporting this.

Page 275, three lines above last execution plan: “to get a selectivity of 1/400 (0.025%)” should be “to get a selectivity of 1/400 (0.25%)”. Thanks to Kun Sun for reporting this.

Page 292, script to create table t1, and following two paragraphs: if you check the script in the code depot, this SQL is the definition for table t3; table t1 is the one created with 100 rows using mod(rownum,10). Thanks to Kun Sun for reporting this.

Page 305, First complete paragraph: I state that the multi-column sanity check and index sanity check were introduced in 10g. However, the index sanity check – based on the existence of a multi-column unique index – has been around since at least 8.1.7.4.