Assessing the effectiveness of your last parallel query
The design aim of the parallel query option is to get better throughput on large queries by reducing the time spent on numerous small I/Os at the cost of using excessive CPU. Sometimes, however, the expected performance benefit does not appear and you need some simple tools to identify why not.
The attached script (with sample output and notes on the output at the end of the page) is one such tool. Oracle holds a log of recent inter-process communication in the session local memory, and the v$pq_tqstat performance monitoring view makes this visible. The log 'pushes down' the stats from previous queries so the easiest way to use this script is to connect to a new session, run your query, then execute this report.
Interpreting the report takes some experience, especially for complex queries with multiple parallel branches; even sychronising the table queues (tq values) with the queue-ids listed under the node column of explain plan can be difficult. However, for very simple queries, there are a couple of general points to note:
Whatever Explain Plan tells you, it may be lying; so if the report returns no rows your query did not do any parallel execution.
You may find that there are layers highlighted in the report (individual table queues) where the workload is severely imbalanced. Perhaps 4 producers in one table queue process several thousand rows each and the rest process none. In this case the effective parallelism is lower than the actual parallelism, possibly due to unlucky data distribution.
Sometime you will find that the bytes per row returned is very high - the parallel query option actually passes aggregate values around using a full 22 bytes rather than the minimum possible, also there are various bugs (or design srategies) in all versions of the parallel query option that result in excessive repetition of aggregate columns.
In one amazing case I found one query passing round rows of 1200 bytes between slaves when the required output could have been achieved in about 70 bytes per row; the problem was largely due to a column 'count(distinct area_code)' appearing seventeen (17) times in the slave SQL. When this occurs you need to look at the query-rewrites produced by the PQO so see if you can restate your basic query in a way that will reduce the PQO overhead.
PQO also has a problem in many versions with 'aggregate pushdown', a feature where sum(), count() and other aggregate functions are supposed to be executed as close to the data as possible, but instead are postponed until a higher layer in the DFO. This can result in a huge volume of message passing, and massive waste of CPU. If this is happening then the total number of rows in the bottom layer producer set is likely to be about the same as the number of rows in your table. Again the only solution is to see if you can rewrite you basic query to bypass a problem with the PQO. (Although Oracle 8 does allow you to increase the size of messages that can pass between slaves from its default of ca. 2K and this can help a bit).
rem Script: pq_tq.sql
rem Author: Jonathan Lewis
rem Purpose: Reports parallel query table queue stats
rem for the current session
rem Execute a parallel query, then run this script.
rem The inter-process volumes and identity of slaves used
rem will be dumped in a file called pq_tq.lst
rem The most recent query will be reported as the first
rem set of stats (dfo_number = 1)
set pause off
set linesize 80
set pagesize 55
set newpage 0
set trimspool on
column dfo_number format 999 heading "dfo"
column tq_id format 999 heading "tq"
column server_type format a10 heading "Serv Type"
column num_rows format 9,999,999 heading "Rows"
column bytes format 9,999,999 heading "Bytes"
column waits format 99,999 heading "Waits"
column instance format 999 heading "Instance"
column process format a12 heading "PQ Slave"
break on dfo_number skip 1 on tq_id skip 1 on report
The bottom layer producers (tq = 0) have all done about the same amount of work and distributed their results fairly evenly to the bottom layer of consumers, so the real degree of parallellism is close to the effective degree of parallelism. There is a bit of a skew, both at the producer and consumer levels, so the query is not going to scale perfectly.
The number of bytes passed per row is about 20 at all levels, so we don't have a problem with multiple copies of aggregate columns etc.
The number of rows passed from the bottom layer producer to its consumers is quite high, and when the consumers turn around to become producers in the second table queue (tq = 1) the drop in the number of rows produced is dramatic, ca. 2000 : 1.
Is this a case of aggregate push-down not working ? We have to go back to the table to answer this question: since the table held 90,000 rows and the bottom layer producers pass 90,000 rows we have a problem: if, on the other hand, the table was supposed to supply a couple of million rows we could be quite happy that the producers were already doing a reasonable job of aggregating results as they read them from the table.
dfo tq Serv Type Rows Bytes Waits Instance PQ Slave
---- ---- ---------- ---------- ---------- ------- ---------- ----------
1 0 Producer 14,402 287844 2 35 P000
Producer 15,874 334833 5 36 P000
Producer 15,947 336105 6 37 P000
Producer 14,120 297413 2 38 P000
Producer 16,774 353905 7 41 P000
Producer 13,457 284588 3 42 P000
Consumer 15,862 331399 43 35 P002
Consumer 14,952 312333 48 36 P002
Consumer 12,954 270682 36 37 P001
Consumer 13,223 275648 53 38 P001
Consumer 16,321 343649 2 41 P001
Consumer 17,262 360977 5 42 P001
1 Producer 84 1566 43 35 P002
Producer 83 1541 48 36 P002
Producer 69 1293 36 37 P001
Producer 78 1454 53 38 P001
Producer 83 1556 2 41 P001
Producer 86 1614 5 42 P001
Consumer 483 9024 9 7 QC