JL Computer Consultancy

When a script isn’t a proof

Apr 2005


In a recent article, I argued (against Don Burleson’s claim) that it was possible to prove points about Oracle performance by using simple scripts. This does not mean, of course, that every simple script claiming to demonstrate some aspect of performance is a sound proof – or even a sensible demonstration. However, a script that attempts to prove something about performance can still be very instructive, as we shall see.

This note is about an attempt to demonstrate that the performance of data loads is improved by using a large block size and is an extract from an article about hypercharging your data loads’. The article in question starts with comments about systems that needed to load huge volumes of data – such as 500,000 rows per second – into a single table, and includes the claim that data loads into a larger block size will run far faster. It introduces the following code with the remark:

Here is a small benchmark showing the performance of loads into a larger blocksize.

alter system set db_2k_cache_size=64m scope=spfile;

alter system set db_16k_cache_size=64m scope=spfile;

 

startup force

 

create tablespace twok blocksize 2k; <-- using ASM defaults to 100m

create tablespace sixteenk blocksize 16k;

 

create table load2k tablespace twok as select * from dba_objects; < creates 8k rows

drop table load2k; <- first create was to preload buffers

 

set timing on;

 

create table load2k tablespace twok as select * from dba_objects;

create table load16k tablespace sixteenk as select * from dba_objects;

 

For a larger sample, I re-issued the create processes with:

 

select * from dba_source; -- (80k rows)

 

Even with this super-tiny sample on Linux using Oracle10g with ASM the results where impressive:

 

2k 16k

blksze blksze

8k table size 4.33 secs 4.16 secs

80k table size 8.74 secs 8.31 secs

Three things struck me immediately.

         8 thousand rows in 4.xx seconds is very slow, not impressive, and certainly nowhere near the 500,000 rows per second that the author started with (and the 80,000 rows in 8.xx seconds is still out by a factor of 50) so when the author reports a 5% improvement it doesn’t fill me with enthusiasm.

         The author seems to think that a demonstration of a technique that is going to write 80,000 rows might be a realistic demonstration of performance characteristics when the volume of data being written to disc starts to grow by a few orders of magnitude (3 orders of magnitude would still only be 80 million rows, or 2 minutes and 40 seconds by the standards set in his introduction). Oracle delays writing as much as possible and most platforms have some sort of buffering and asynch mechanisms that cope quite happily with the first few megabytes before the load has any impact.

         The amount of information missing from the demonstration is significant. There are several questions that need to be addressed before any meaning could be attached to the results.

So let’s think carefully about the flaws in this example.

Data loading doesn’t usually mean ‘create table as select’ (CTAS) – it usually means getting the data into the database from a source outside the database. So is a demonstration of CTAS really valid for anything other than demonstrating the special optimizations of CTAS? Well, yes – because quite often people will load the data into scratch tables, and then do a CTAS to massage it into shape.

Is the database in archivelog mode, or not? This makes a big difference to performance as CTAS is not normally logged if the database is running in noarchivelog mode and that’s a big (treble) saving in I/O as the redo isn’t written, and the archiver isn’t reading and writing as the test runs. Of course, it is possible that the database is running in noarchivelog mode but has been set to the force_logging state – in which case CTAS will be logged but the archiver won’t be reading and writing. For reasons of comparative testing we need to know the setup. In passing, a data load that takes place without being logged is a safety threat – if the data loaded is ‘real’ rather than some form of scratch data waiting for post-processing, the first thing you should do after a nologging load is to take a backup of the relevant tablespaces (and since this is a CTAS, it is presumably the step that generates the real data) – so for a fair benchmark, we should add the backup time to the load time if we are not logging.

Let’s assume the test was done in logging mode with archiving enabled – you’ll see why we have to make this assumption in a minute – and ask where the time went. CTAS uses direct path writes – the process doing the work formats blocks in local memory, and writes them to the data files – the blocks don’t go through the buffer cache. (The 2K cache and 16K cache don’t need to be 64MB if you are short of memory – when I ran just the CTAS statements Oracle ended up with just 3 blocks from each tablespace in the caches: two file mapping blocks, and one segment header block). Consequently you don’t really have to worry about I/O interference from other processes using the caches. Your main worry is how fast the I/O subsystem might be and what happens with the log buffer and log files. So we need to know the size of the log buffer, the size and number of the log files, and the number of files per redo log group for this test. And while we are on the topic of I/O, Oracle can perform direct path writes asynchronously – so do we see time lost on direct path writes, or async disk IO, and do we see CPU starvation (outside Oracle perhaps) caused by polling for async I/O completion?

Once we know the database configuration we then need to know how much time in the tests was spent for waits on direct path write, log buffer space, log file syncs (echoed by log file parallel writes from lgwr, perhaps) and any of the log file switch events. This might, of course, show us that the slower tests happened to coincide with a log file switch, and the differences in timing had nothing to do with the block sizes.

Finally, of course, we just have to sit down and try to emulate the test and when I did it on my Oracle 9i - which has 3 x 100MB redo log files with only one file per log group, a log buffer of 512KB, archive logging enabled, 31,600 rows (average length 93 bytes) in dba_objects – my test results were as follows:

2K block size

 

 

 

Statistic Description

Statistic value

 

Elapsed time to completion

0.04 seconds

 

Redo size

3,991,300 bytes

 

Physical writes direct

1,870

 

Log buffer space (wait)

0.0267 seconds

 

Log file sync (wait)

0.00267 seconds

 

Direct path write

0.00125 seconds

 

Control file sequential read

0.0208 seconds

16K block size

 

 

 

Elapsed time to completion

0.04 seconds

 

Redo size

3,628,028 bytes

 

Physical writes direct

213

 

Log buffer space (wait)

0.0240 seconds

 

Log file sync (wait)

0.00099 seconds

 

Direct path write

0.00253 seconds

 

Control file sequential read

0.0304 seconds

You will notice that my basic time to completion was 100 times faster than the original, on a data set that was four times larger. Clearly there is something wrong with the original results. Somewhere, something is losing a hideous amount of time. Could it possibly be that 10g (or maybe 10g with ASM as reported in the article) is more than forty times slower than my 9i on a laptop?

Of course, looking at the test case, one of the oddities that stands out is that the test with eighty thousand rows takes only twice the time of the test with eight thousand rows (and rows from dba_source are typically longer than rows from dba_objects) – doesn’t that suggest (at least) that there is a massive startup cost involved somewhere. And if there is a massive startup cost, shouldn’t the original author have done something to track it down and factor it out – and tell us about it.

Just for completeness, here are some results from repeating the test using dba_source (which had 174,000 rows, average length 352 bytes, in my 9i system).

2K block size

 

 

 

Statistic Description

Statistic value

 

Elapsed time to completion

6.05 seconds

 

Redo size

78,856,644 bytes

 

Physical writes direct

37,138

16K block size

 

 

 

Elapsed time to completion

8.09 seconds

 

Redo size

81,799,716 bytes

 

Physical writes direct

4,962

All reported wait times were insignificant, the highest totalling 0.08 seconds for log file sync on the 2K block test. And yes, the test with the 16K block size was really much slower than the test with the 2K block size – but that was an oddity relating to pctfree and chained rows, and the anomaly didn’t appear when I re-ran the tests on 10g because 10g compresses wrapped sources and therefore has a much shorter typical row length. (For future reference, you want to make your test data sets look roughly like your real data sets – dba_source is one of the more unusual data sets you can find on an Oracle system).

You will notice, of course, that the table created in the 16K block tablespace is much larger than the table created in the 2K block tablespace (If you multiply the number of 16K blocks by 8, it should be similar to the number of 2K blocks: 4,962 * 8 = 39,688, so the table in the 2K tablespace seems to be 2,500 blocks smaller than you might expect), and the redo generated is much less as a consequence. Strange that, because some people think that building things in bigger block sizes is more efficient and will allow tablescans to be much quicker – sometimes things just aren’t the way you expect them to be, and you have to think very carefully about your data.

I said earlier on that we ought to make the assumption that this test was done with logging and archive logging enabled; now you know why - I needed to have them enabled to make the tests run slowly enough to see what was happening.

In fact, I was able to get closer to the author’s original test results (allowing for the larger volume of data in my test case), by replacing my 100MB redo log files with 10MB redo log files – at which point the results for the dba_source test changed to: 2K blocks => 31.06 seconds (14.5 seconds recorded wait time), 16K blocks => 39.05 seconds (21.3 seconds recorded wait time). In this test, most of the reported extra time was due to log file switch completion and consequent log buffer waits – which rather suggests that getting your logging strategy right is rather more important than fiddling with your block sizes. (There was also some lost time that simply didn’t get reported – this is one of the side effects of Oracle’s implementation of aynchronous calls when doing direct path writes, the trace files for this test showed some very odd results when I tried to reconcile tim= timestamps with intervening ela= times for direct path write waits).


My test case:

The following SQL is the basis for my tests. There are some explanatory comments in-line. The snap_my_stats and snap_events packages are simple pl/sql packages to take snapshots of v$mystat and v$session_event (for the current session).

/*

 

rem

rem I already had 128M for db_cache size – which I needed for the

rem test against dba_source to minimise sequential read costs.

rem

rem You may need to reduce some other cache to make it possible to

rem increase these caches online.

rem

rem This section of code is a one-off preparation, which is why it

rem has been commented out. Note that the preparation code has to

rem be run by a highly privileged account, and you need to adjust

rem file names for the tablespaces. (And make sure that the table-

rem spaces don’t already exist).

rem

rem The calls to the snap_xxx packages are my way of

rem checking on v$mystat and v$session_event to pick

rem up the work done and time lost by my session.

rem See Snapshot.html for further details.

rem

 

alter system set db_2k_cache_size=8m scope=spfile;

alter system set db_16k_cache_size=8m scope=spfile;

 

startup force

 

create tablespace ts_02k

datafile 'c:\oracle\oradata\d920\ts_02k.dbf'

size 129M

blocksize 2K

extent management local

uniform size 1M

;

 

create tablespace ts_16k

datafile 'c:\oracle\oradata\d920\ts_16k.dbf'

size 129M

blocksize 16K

extent management local

uniform size 1M

;

 

*/

 

rem

rem setenv.sql is just a standardizing list of ‘set’ commands

rem like ‘set linesize 120’, column XX format YYY, etc.

 

start setenv

set timing on

 

drop table load_02k;

drop table load_16k;

 

rem

rem Make the test fair – start at the beginning of a log file each time

rem Wait a few seconds for any dbwr activity to clear

rem

 

alter system switch logfile;

execute dbms_lock.sleep(10);

 

rem

rem The /* nologging */ is there to make it easy to

rem switch between logging and nologging tests with

rem a single global edit

rem

 

create table load_02k /* nologging */ tablespace ts_02k as

select * from dba_source;

 

alter system switch logfile;

execute dbms_lock.sleep(10);

 

create table load_16k /* nologging */ tablespace ts_16k as

select * from dba_source;

 

rem

rem Just check how much of the buffers we have used.

rem If you hadn’t realized it yet, I was connected

rem as SYS to run this test

rem

 

select

set_ds, blsiz,count(*)

from x$bh

where dbarfil != 0

and state != 0

group by

set_ds, blsiz

;

 

rem

rem Don’t really need to do this bit, but when I got

rem the very bad time on the 16K test, I wanted to

rem get the table statistics so I could check sizes

rem

 

begin

dbms_stats.gather_table_stats(

user,

'load_02k',

cascade => true,

estimate_percent => null,

method_opt => 'for all columns size 1'

);

end;

/

 

begin

dbms_stats.gather_table_stats(

user,

'load_16k',

cascade => true,

estimate_percent => null,

method_opt => 'for all columns size 1'

);

end;

/

 

Select

table_name, blocks, avg_row_len

from

user_tables

where

table_name like 'LOAD%K'

;

 

/*

 

-- dba_objects created 8,000 rows in the original test, 31,623 in mine

-- dba_source created 80,000 rows in the original test, 174,065 in mine

 

My results in 9i

 

dba_source (row len 352)

------------------------

2k 16k

Blocks 37,138 4,962

Extents 73 78

Chained rows 12,524 0

 

 

dba_objects (row len 93)

------------------------

2k 16k

Blocks 1,871 213

Extents 4 4

Chained rows 0 0

 

*/

 

rem

rem Now start again with the monitored tests

rem

 

drop table load_02k;

drop table load_16k;

 

spool check_blocks

 

alter system switch logfile;

execute dbms_lock.sleep(10);

 

execute snap_events.start_snap

execute snap_my_stats.start_snap

 

create table load_02k /* nologging */ tablespace ts_02k as

select * from dba_source;

 

execute snap_events.end_snap

execute snap_my_stats.end_snap

 

drop table load_02k;

 

alter system switch logfile;

execute dbms_lock.sleep(10);

 

execute snap_events.start_snap

execute snap_my_stats.start_snap

 

create table load_16k /* nologging */ tablespace ts_16k as

select * from dba_source;

 

execute snap_events.end_snap

execute snap_my_stats.end_snap

 

drop table load_16k;

 

spool off

NOTE: this script is not intended to prove anything, it is simply investigating a couple of anomalous results that appeared in a test published by another author, and to demonstrate that you do have to apply a little thought to make your test a valid test.


Conclusion

If you want to get a realistic idea of how a large-scale operation will perform, you have to think and test very carefully before you can claim that a small test is any way representative of a big production system.

It is often perfectly reasonable to make the claim: “This small test case has a performance issue, so the design cannot work at production scales”. It is much harder to get it right if you want to say: “This small test case performs really well, so the design will work well at production scales.” If you want your test cases to be realistic, you always have to think about the other things that will be going on elsewhere in the system.


Footnote:

If you read the original article, you will notice that one of the responses points out that the following statement made in the article is incorrect:

At each commit, the DBWR process will try to write all dirty blocks to disk, performing an expensive full-scan of the RAM data buffer (db_cache_size). To minimize DBWR work, commit as infrequently as possible.

The error seems to have been edited out of the article, but the article was brought to my attention by someone emailing me to ask for my comments on the original statement.


Back to Index of Topics