Statistics and who to believe.
In a recent article in the magazine of the Northern California Oracle User Group Iggy Fernandez explained that there were many dangerous beliefs centered around the theme that application developers had no responsibility for the performance of the SQL statements they write. One of his examples of beliefs was:
Dangerous Belief #5: The most common cause of poorly performing SQL is the failure of the DBA to collect statistical information on the distribution of data for the use of the query optimizer.1 This statistical information should be refreshed frequently.2
Broadly speaking, I agree with Iggy’s assessment – there is a fairly common perception that if you use procedures like dbms_stats.gather_schema_stats() frequently, with a high percentage estimate, and collecting lots of histogram, most of your SQL performance problems will go away; and it is a “dangerous belief”. On the other hand, if you ensure that the statistics stored against the database are (a) acquired cost-effectively, and (b) describe the data reasonably accurately, the optimizer will be able to work around many of the problems caused by bad design and bad SQL (This means that I would like to sit and argue with Iggy about the best way to express exactly what the belief is, and clarify exactly what bits of it are dangerous and why – but on the whole I understand the intent of what he is saying)
Notwithstanding exact wording and degree of consensus, this note is not really about the topic of statistics, it’s about the irony of the two footnotes listed in this one belief. Footnote 1 (citing an example of someone who believes that the most common problem is failure to collect statistics) says:
Consider, for example, the following statement found in an article published in a recent issue of the journal of the IOUG: “One of the greatest problems with the Oracle cost-based optimizer was the failure of the Oracle DBA to gather accurate schema statistics …The issue of stale statistics and the requirement for manual analysis resulted in a “bum rap” for Oracle’s cost-based optimizer, and beginner DBAs often falsely accused the CBO of failing to generate optimal execution plans when the real cause of the sub-optimal execution plan was the DBA’s failure to collect complete schema statistics.”
On the other hand, footnote 2 says:
The following statement by Donald Burleson puts the finger on the dangers of collecting fresh statistics information for the use of the query optimizer: It astonishes me how many shops prohibit any un-approved production changes and yet re-analyze schema stats weekly. Evidently they do not understand that the purpose of schema re-analysis is to change the production SQL execution plans, and they act surprised when performance changes!
Despite the fact that I can agree with the general thrust of Iggy’s argument, I was quite amused by the choice of quotations
In passing, the basic premise declared by Donald Burleson in footnote 2 is wrong, but that wasn’t the cause of my amusement. Think carefully about the content and use of a typical database – as time passes, the data changes, and the data being targeted by the end-user queries changes. A query of the form: “get me last week’s sales information” will return a different result set on the 1st August from the one it returned on the 1st January – but if the last time you updated the statistics was 1st July then the optimizer will infer that there is no data for the week ending 1st August and may therefore do something incredibly stupid – perhaps changing from an appropriate table scan to an unsuitable index access, or from a sensible index to a bad index. Sometimes you change the statistics to make the plans stay the same. If you don’t realise that then you’ve missed an important point about how the optimizer uses statistics, and you are unlikely to have an optimal stats-generation strategy.
The thing that amused me, however, was this – I checked the link given for the IOUG article, and found that the article was one called “Inside Oracle Database 10g Dynamic Sampling” from the Quarter 1 2006 issue, with the byline of Alexey B. Danchenkov. However, when I started reading it, I found that it looked familiar so I ran a quick search on the internet for it and found it (using a simple Google search on “bum rap” dynamic sampling) at http://www.dbazine.com/olc/olc-articles/burleson30 It was an article published by Don Burleson about two and a half years ago. (This isn’t the first time that the name Alexey B Danchenkov has appeared in Select on an article previously claimed by Don Burleson – I’ve sent an email pointing out the apparent error to the editorial address).
It’s a little hard to avoid laughing when the same person is identified (albeit unwittingly) as the sinner and the saint in the space of two sentences. So which Don Burleson should we trust – the Don Burleson who says the DBA is at fault because he doesn’t gather accurate schema statistics, or the Don Burleson who says the shop is at fault because it does gather accurate schema statistics.