A slightly unusual use of pipeline functions
A large part of my job is to find ways to make databases “go faster” without making any significant changes to structure or code. “Significant” is a bit vague, of course, but typically means things like “invisible to the application” as far as the structure is concerned, or “very localized” if code is involved, and always means “minimum risk, minimum side-effects”.
I have come across two examples recently where short data-loading scripts could be modified very locally to reduce the workload resources used – particularly CPU and disk activity, quite dramatically. It isn’t a mechanism that will be universally appropriate – but for the right data pattern it’s certainly worth considering.
The requirement is quite simple. We had data loads where rows belong to ‘groups’, need to be given some form of ‘group id’ as they get into the database. There are many ways to implement this type of requirement, with various costs involved. One option is to load the data, create a summary (or parent) table for the loaded data, then use the summary table to generate a sequence-based id. Another option would be to play around with analytic functions, partitioning on the “group” columns, and using a mechanism to bring in a “group id” on the first row of each “group”.
Every option introduces its own drawbacks – the first example above, for example, may require you to do too much work populating and updating real (or perhaps global temporary) tables; the second option may do excessive amounts of sorting as a side effect of using analytic functions if the rows are wide rows.
SO, as an extra weapon in the armoury, here’s a method that uses pipeline functions to minimise the I/O and CPU costs of sorting or using tables. The example is given as a script with in-line comments. The notional task is to add a sequence number to rows, based on grouping values by multiple columns.
rem Script: pipe_sequence.sql
rem Author: Jonathan Lewis
rem Dated: June 2006
rem Purpose: Getting a sequence number onto grouped data.
rem Last tested
rem Not tested
rem Not relevant
rem There are various ways you might group data, and then
rem add a sequence number (or other unique identifier) to
rem rows that belong to the same group.
rem This case was based on a 'store card' number used to group
rem together purchases that occured within a short time period
rem of each other. (The real system defined as visit as all
rem purchases within 10 minutes of each other - which required
rem playing games with rounding on a time interval).
rem Other options
rem use of analytic functions
rem creating global temporary tables
rem calling functions that return either nextval or currval
rem Change the function to use pl/sql array fetches (bulk collect)
drop table t1;
drop sequence s1;
drop type jpl_table_type;
drop type jpl_scalar_type;
rem We will be collating rows with the same value for columns
rem group_1 and group_2, and giving them a single sequence value
create sequence s1;
create table t1 as
rownum <= 100
rem A pipeline function needs an array type as its return,
rem a corresponding scalar type for each row created
create type jpl_scalar_type as object(
create type jpl_table_type as table of jpl_scalar_type;
rem We select distinct the columns that are the identifying
rem columns – by hiding this query in a pipeline function we
rem a) can use the nextval construct
rem b) avoid creating an intermediate table
rem c) minimise the width of the row being sorted
create or replace function pipe_fun
for r in (
select group_1, group_2, s1.nextval seq_id
select distinct group_1, group_2
order by group_1, group_2
pipe row (jpl_scalar_type( r.group_1, r.group_2, r.seq_id));
rem To demonstrate how to use the mechanism we join the
rem the pipeline function to the base table on the grouping
rem columns – which allows us to append the sequence value
rem to the end of the row in (apparently) one step
break on seq_id skip 1
column padding noprint
/*+ ordered use_hash (t) */
t.group_1 = v.group_1
and t.group_2 = v.group_2
v.seq_id, t.group_1, t.group_2, t.id
rem Note – there is an “implementation detail” in 9i
rem which means you hit three rows in the dictionary
rem cache every time you fetch a row from the pipeline
rem function. This is a concurrency threat that has been
rem eliminated in 10g.