It’s not often that you visit a site where the DBA team has the facilities and procedures for testing things properly. But when you do get to visit such a site, the results can be very rewarding – for a start, the on-site DBAs can actually judge the quality of your work, test your suggestions properly, and give you some very precise feedback on your suggestions.
I was at such a site quite recently. I had visited them about three months previously to see if I could give them a couple of new ideas about addressing some performance issues but I’d only been able to drop in for half a day. The site had a very big web-based operation with a fairly high degree of (real) concurrent activity. The problem was that when their website came under heavy load (and this happened on a fairly regular basis) end-user response times would go sharply up and throughput would drop. Based on the information I was getting from some of my ‘snapshot’ packages, and the description I had been given of the purpose and usage of the system I made a few suggestions during that short visit, of which the two most significant related to using single table hash clusters, and the dbms_lock package.
So three months later I was back on site to see how things were going and to be available for any emergencies that might arise during a particularly heavy peak period that they were expecting. (I can’t tell you who they are or what they do but their predictions about the peak period were correct and at one point they recorded one million hits on their website in five minutes).
Whilst I was there, they told me how they had tested my suggestions. They had recoded for the single table hash cluster and set up a few test runs on their emulation system to see what would happen – they got a 10% increase in throughput – which was a little disappointing (to me as well as to them, I had expected it to be rather better than that). Then they had taken out the single table hash cluster and done some test runs with code that included my suggestions about the dbms_lock package – again they got a 10% increase in throughput – better than nothing, but equally disappointing.
Still, two lots of 10% is better than nothing, so they ran up a series of tests with both fixes in place – and the throughput doubled.
It is often said that when you manage to eliminate a bottleneck, all you do is make the next bottleneck visible.
It is not often that you get such a wonderfully clear demonstration of this effect. It didn’t matter which bottleneck they took out of the system, when a process was able to get through one stage of the work more quickly it just hit the other bottleneck sooner and stuck longer because more processes were now arriving at that bottleneck concurrently.
I’ve received an email asking me how the improvement was measured. This is always a good question to ask when someone claims to have produced a performance benefit. The answer in this case is simple. The developers were good at what they did; their programs were instrumented, so they had code that went:
I am about to start a database dialogue – what’s the time
Call database procedure
I have just received a completion signal from the database – what’s the time, how long did the job take.
So they had a complete log of what transactions were going on, actual execution time of each transaction, and (by collating and summarizing the logs) number of transactions per minute.