# JL Computer Consultancy

## Chapter 10: Join Cardinality

Ch.10 Code Depot (29th Nov 2005)

Join Cardinality Cracked  (29th Nov 2005)

More on Join Cardinality  (12th Dec 2005)

Typos (1st Dec 2009)

### Join Cardinality Cracked (29th Nov 2005)

Page 271: Congratulations and thanks are due to Alberto Dell’Era, who read the section on “Biased Joins”, and  decided to do some further testing with script join_card_01a.sql. The extra tests consisted largely of fixing the number of rows in one table, varying the number of rows in the other table across a great range of values, and checking the effects the variation had on the resulting 10053 trace files.

The critical change in the trace files appeared in lines like the following, where the number of rows in the second table affected the sel (selectivity) used to derive the join cardinality:

Join cardinality:   81081 = outer (100) * inner (30000) * sel (2.7027e-002)

Having identified the critical point of change, Alberto did some arithmetical manipulation, spotted a pattern, derived a formula that generated the pattern and produced a rational for that formula – at which point he sent me an email to explain what he had done. Then he sent me another email with a URL referencing a document about a standard formula of probability theory, pointing out that the two formulae (his derivation, and the standard formula) were equivalent – at which point I was struck by a flash of 20-20 hind-sight (or the ‘doh’ moment as Tom Kyte sometimes calls it) - sometimes things are so straightforward after someone has explained it all.

The concept is simple.  Think of a bag holding 50 balls, each numbered uniquely. Select a ball from the bag, note down the number, and put the ball back into th bag. Repeat 100 times. At the end of the experiment, how many different numbers will you have selected? Obviously you can’t say – at one very unlikely extreme you might have picked the same ball 100 times, at the other extreme you might have picked all 50 of them at least once each.

Repeat the experiment a few thousand times and note how many different numbers you get each time you select 100 balls. Is it possible to say anything about thousands of different results? The answer is yes. Surprising though it may seem, there is a standard formula for the average (or expected) number of distinct values that you will get by picking a ball 100 times when there are 50 possible numbers on the balls. It will be:  50 * (1 – (49/50)100).    (In linear notation:  50 * (1 – (49/50) ^ 100), where ^ means “to the power of”.) The answer is 43.36.

How does this apply to Oracle and join cardinality? Consider the following where clause:

where

t1.joincol = t2.joincol       -- assume that t1.joincol has 50 distinct values, and t2.joincol has 80.

and t1.filtercol = 99             -- identifies 100 rows in t1 (say)

and t2.filtercol = 99             -- identifies 150 rows in t2 (say)

On page 303 I gave you the “Alternative Viewpoint” of a join, saying you could consider the two filter predicates first, then do a Cartesian join of the resulting two data sets, and then look at the join condition as two possible predicates on the Cartesian join, using the selectivity of the more precise condition to generate the final cardinality. So the where clause above requires you to find the selectivities for:

A)  t1.joincol = {constant}

B)  {constant} = t2.joincol

Looking at just the first of these predicates, what is the selectivity of t1.joincol = {constant}.  We know that the answer is (traditionally) 1/num_distinct. But just how many distinct values does t1.joincol have? As a baseline,it started with 50, but we’ve used a filter predicate to pick just 100 of those rows. So what’s the probable number of distinct values for t1.joincol in the 100 rows we’ve picked?

Number of balls in bag = user_tab_columns.num_distinct = 50

Number of times we pick = filtered cardinality of table = 100

Probable number of distinct values picked = probable number of distinct values for t1.joincol = 50 * (1 – (49/50)100 ) = 43.36

Oracle takes the ceiling of this value (i.e. 44) and uses that as the “adjusted num_distinct in its standard formula.

Join Selectivity =

((num_rows(t1) - num_nulls(t1.joincol)) / num_rows(t1)) *

((num_rows(t2) - num_nulls(t2.joincol)) / num_rows(t2)) /

greater(num_distinct(t1.joincol), num_distinct(t2.joincol))

The equivalent value for t2.joincol would be:  ceiling(80 * (1 – (79/80)150 ) = 68

The general formula for calculating the expected number of distinct values when you have N balls (num_distinct) and pick n times (filtered cardinality) is:

N * ( 1 – (1 –  1/N )n )             which can also be written as               N * (1 – ((N - 1 ) / N) n )

I used the second form of the formula above in my worked example, but the first form appears in the reference that Alberto sent to me. It is interesting to note, then, that if you look at the term ((N - 1 ) / N) n – for large values of n, and even for fairly small values of n if N is also small, the term is likely to be vanishingly small. Which means the whole expression gets close to N * (1 – 0), which is the base number of distinct values for the column. So the formula published by Metalink will be correct – or very close – in many general cases, especially with large data sets.

If you apply this formula to the example in section “Difficult Bits”, the data generated by script join_card_10.sql, you will find that this goes a long way to explaining the massive difference between the two cardinalities that it gives in Oracle 9i. However, it does not produce the exact cardinalities expected.

There are still some rules that have to be uncovered for situations like ‘extreme differences between the number of distinct values in the join columns’, and ‘what to do if combined selectivities are less than 1/total rows in table”; not to mention how you apply the formula in the presence of the ‘concatenated index sanity check’.

However, Alberto has given us a massive step forward in working out many of the strange numbers that appear as join cardinalities, and earned a place in my personal hall of fame alongside Wolfgang Breitling). For future reference, I shall be referring to this result as the Alberto Dell’Era Deduction.

Note: I have added join_card_new_01.sql to the code depot for this chapter, and an updated version of join_card_10.sql.

### More on Join Cardinality (14th Dec 2005)

Following on from the previous notes on Join Cardinality – Alberto Dell’Era has been pursuing the problem of accuracy. Knowledge often progresses by steps, and a first approximation often introduces the annoying error that leads to a better understanding and approximation.

In his latest email to me, Alberto has pointed out that there is an even more appropriate formula from the world of probability theory that can be applied to database selectivity, and this is the formula relating to selection without replacement. (The previous section was using a formula for selection with replacement).

I hope to go into this in more detail later, but put simply, the formula says:

Let nr be the number of rows in the table

Let nd be the number of distinct values in a join column

Let s be the number of rows selected from the table by predicates on non-join columns

Then the average number of distinct values (i.e. the number we should put into num_distinct in the Metalink formula) is:

nd * ( 1 – ((nr – s) / nr) nr / nd )              which can also be written as               nd * ( 1 – (1 – s / nr) nr / nd )

Although this formula does not seem to be very similar to the earlier formula, it is known to give results which are very similar to the earlier formula for ‘small’ sample sizes. This, presumably, is why the previous formula gave fairly consistent answers in the test cases I described in my earlier update.

Alberto has also demonstrated the suitability of this formula for queries of the type: select distinct where …

Back to Book Index

## Errata

### Typos (1st Dec 2009)

Page 265, second formula in Basic Join Cardinality: Card(Pj) = card(t1) * card(t1) * sel(Pj) should be Card(Pj) = card(t1) * card(t2) * sel(Pj).  Thanks to Kun Sun for reporting this.

Page 270, first paragraph after code extract: “I have 10,000 rows in each table” should be “I have 1,000 rows in each table” – but you probably spotted that one.

Page 271, three paragraphs above heading Join Cardinality for Real SQL: “Is it a coincidence that 1,000/30 = 333 ?” should be “Is it a coincidence that 10 * 1,000 / 30 = 333 ?”.  Thanks to Kun Sun for reporting this.

Page 275, three lines above last execution plan: “to get a selectivity of 1/400 (0.025%)” should be “to get a selectivity of 1/400 (0.25%)”. Thanks to Kun Sun for reporting this.

Page 292, script to create table t1, and following two paragraphs: if you check the script in the code depot, this SQL is the definition for table t3; table t1 is the one created with 100 rows using mod(rownum,10).  Thanks to Kun Sun for reporting this.

Page 305, First complete paragraph: I state that the multi-column sanity check and index sanity check were introduced in 10g. However, the index sanity check – based on the existence of a multi-column unique index – has been around since at least 8.1.7.4.

Back to Book Index