Parallel Query Optimisation

Large Database Query Optimization using Parallel Query (Tech Note)

Introduction

Oracle optimizer is based on the assumption that your database is going to be accessed in a transactional way. When you're designing data warehouse, you'll realize the limits of those assumptions. Optimizing large queries is a difficult task. You've got to go through those different steps:
  1. Analyzing the execution plan table
  2. Checking if Star Query optimization is happening/appropriate
  3. Deciding to use Parallel Query
  4. Changing the execution Query
  5. Optimizing Full scan
  6. Droping indexes
  7. Testing that PQ is actually used
The present article shows rule of thumbs based on specific cases. Tuning is not an exact science. For most simple cases, the Oracle Optimizer does a good job if you do complete statistics (including historical measure on non randomly distributed columns). But for some cases, the optimizer takes very bad assumptions and can lead to queries that can take years to complete !. In some very specific cases (7 joins) a good Parallel Query combined with PL/SQL (using PL binary table) can reduce year-long computing to only 40 minutes.

Analyzing the execution plan table

A classical path for a transaction looks like this:

Nested Loop

Nested Loop
    Sales    Index Unique Primary Key
    Sales    Access by ROWID
This double nested loop means that: for each element of Sales Ticket, scan the unique index of sales. For each element of sales, get the information missing from the index data. This is a classical plan table for a transactional system.
The time needed to perform that join is:
TimeToAccessByRowID * NumberOfGoodElementOfSalesID.

Let's look at a complex query:
Nested Loop
Nested Loop
Nested Loop
Nested Loop
Nested Loop
The time needed to perform this join is
TimeToAccessByRowID * NumberOfGoodElement(TableA) * NumberOfGoodElement(TableB) * NumberOfGoodElement(TableC)

The same query using full scan and merge join looks like this
Merge Join

Sort Join
fullscan    C
Merge Join
Sort Join
fullscan    A
fullscan    B
the time needed to perform that join is
TimeFullScan(A)+TimeFullScan(B)+TimeFullScan(C)+Time SortJoin(A,B)+Time MergeJoin(A,B)+TimeSortJoin(C,Match(A,B))+TimeMergeJoin(C,Match(A,B))

Eventhough your indexes are valid in the context (The number of Good Element is less than 5% of the table), you can see that the more joins you do, the longer the index method takes. Actually one can say that the access time is proportional to power(AccessTableByRowID,NumberOfJoins) which is to be compared to AccessOfFullScan * NumberOfJoin.

What is Star Query Optimization ?

You'll probably see an execution plan like this

Merge Join Cartesian

fullscan A
fullscan B
This is an effect of the Star Query Optimization. Star Query states that if table A and B are small, then it's more efficient to do a direct Cartesian Product (i.e. try all the combinaisons) than to do a regular Join. It's very efficient when the tables are actually small. The time to query such a join is proportional to NumberOfLine(Table A) * NumberOfLine(Table B). The only problem is that Oracle considers a table as small if it fits in the SGA. If you have a big SGA (let's say 500 MO) then any couple of table smaller than 500 MO is candidate to such query optimisation. Actually in alot of cases, you don't want such Star Query.

Deciding to use Parallel Query

Contrary to access by index, Full Scan can really be enhanced by Parallel Query. If you have multiple cpu and multiple hdisk than you can declare the parallel level of the access to a table. The parallel level is between the number of cpu available and the number of disks that your machine can access simultaneously. For example on a Bull Risc6000 machine with 8 processors accessing 16 hdisk, the optimal level found on a specific case was 12. Of course Parallel Server is less efficient than Parallel Query on a single machine (due mainly to the need of coordination of the different SGA between the machine), which is less efficient than classical Query executed on a monoprocessing machine (due to the cost of the synchronization of the different master processes). The right parallel level is closely linked to the specificity of the query, the data and the machine. You've got to make benchmarks for each join.

Changing execution plan

To force parallel query and making it efficient, you've got to be sure to eliminate all Nested Loop. External Nested Loop are the most costly.

For exemple:

Nested Loop
Nested Loop
Index C1 non unique
Access C by ROWID
Merge Join
Sort Join
Full Scan A
Full Scan B
... is an example of poor optimization because it forces a full scan for each item found in the index. A nested loop is usually associated with an index. A basic solution is to include hints like:

Select /*+ ORDERED FULL(CC) FULL(AA) FULL(BB) */ ... from A AA, B BB, C CC ....;

which change the execution plan to:

Merge Join

Sort Join
Full Scan (C)
Merge Join
Sort Join
Full Scan A
Full Scan B
The ORDERED clause is important. As with indexes, the order of the table to be associated, affects the performance of the join. As a rule of thumb, you should stick to the order of the table given by the use of indexes. Using full scan still requires collecting statistics about the validity of indexes and the historic of column. If an index is better to join two tables (i.e. is the most selective in term of tupples returned), the merge join is going to be a good filter aswell. Remember that the performance of the merge join is linked to the number of tupples returned: the less, the better.

Optimizing Full Scan

Parallel Query is basically executing tasks which are candidates to parallel execution. What can be done in parallel ?: Full Scan, sorting, hash and merge joins. How does it work? Parallel Query makes the assumption that "extents" in tables can be accessed simultaneously. Each slave processes read it's own extent. All the work for the dba is:
make sure that the different extents are allocated on different disks that can be accessed simultaneously (i.e. connected to different driver cards)
make sure that the data is homogeneously distributed between the different extents (i.e. all the extents have the same number of records).
This is what we call Oracle Stripping.
It's done typically for a parallel level of 4 by a :
create table foo
...
tablespace tableAstripped
minextents    4
maxextents   4
size    50Mo;
The tablespace itself must be stripped by using multiple files:
create tablespace tableFooStripped
files
'/u08/oradata/SALE/FOO_08.dbf'   size 63Mo,
'/u09/oradata/SALE/FOO_09.dbf'   size 63Mo,
'/u10/oradata/SALE/FOO_10.dbf    size 63Mo,
'/u11/oradata/SALE/FOO_11.dbf'   size 63Mo;

Those files are actually hardlinks to logical raw devices. Logical raw devices are another source of optimization. When Oracle accesses database files, it looks first to see if the block is not in the SGA. The SGA is a very efficient cache system, better than the default Aix File System cache. It's better to use a big SGA which has a consistent size in system memory, than relying only on the Aix Cache system which grows continuously and may swap the SGA out of memory. Using Raw devices allows you to bypass the Aix Cache system. Thats why Raw devices are more efficient standard File System (JFS).

Of course that sort of tuning is a long and tedious activity. You've got to know the actual size of the table, the degree of parallelism, having plenty of disk and plenty of memory. If the table increases too quickly you'll realize that the last slave processes reading the table will work much more than the others. PQ requires not only to distribute evenly the data between the axes but to distribute the process time between the processors too.
In the example you can see that the extents of the table are smaller than the "extent" of the data file. In fact you can have a file "extent" size slightly smaller than the double of the size of the table extent. Oracle creates a table extent on a new file when the extent doesnt fit on the remainder of the previous file. That allows you to prepare increases in the size of the table. You can always recreate the table later like this:

drop table foo;
create table foo
...
tablespace tableFooStripped
minextents    16
maxextents   16
size    55Mo;
and be sure that each extent sits on a different file, i.e. on a different hard disk.

The same way you can do an over stripping if you have big space capacity and alot of disks like this:
create tablespace tableFooOverStripped
files
'/u08/oradata/SALE/FOO_08.dbf'   size 15Mo
'/u09/oradata/SALE/FOO_09.dbf'   size 15Mo
'/u10/oradata/SALE/FOO_10.dbf    size 15Mo
'/u11/oradata/SALE/FOO_11.dbf    size 15Mo
'/u12/oradata/SALE/FOO_12.dbf    size 15Mo
'/u13/oradata/SALE/FOO_13.dbf'   size 15Mo
'/u14/oradata/SALE/FOO_14.dbf'   size 15Mo
'/u15/oradata/SALE/FOO_15.dbf'   size 15Mo
'/u15/oradata/SALE/FOO_16.dbf'   size 15Mo
'/u17/oradata/SALE/FOO_17.dbf'   size 15Mo
'/u18/oradata/SALE/FOO_18.dbf'   size 15Mo
'/u19/oradata/SALE/FOO_19.dbf'   size 15Mo
'/u20/oradata/SALE/FOO_20.dbf'   size 15Mo
'/u21/oradata/SALE/FOO_21.dbf'   size 15Mo
'/u22/oradata/SALE/FOO_22.dbf'   size 15Mo
'/u23/oradata/SALE/FOO_23.dbf'   size 15Mo
'/u24/oradata/SALE/FOO_24.dbf'   size 15Mo
default size
min extent    16
max extent    16
size          12Mo...
create table foo
...
tablespace    tableFooOverStripped;
Of course you've got to make sure that the different disks are not accessed at the same time. Parallel Query requires a very big temporary tablespace and a big rollback segment. You should separate those different tablespaces:

  • SYSTEM
  • ROLLBACK SEGMENT
  • TEMPORARY TABLESPACE
  • Tables that are not stripped
  • Each stripped table
  • Dropping indexes

    The use of indexes on tables, to be read in parallel, is dangerous. Usually it modifies the execution plan encouraging the optimizer to use an index based on the assumption that a query on less than 5% of the table goes faster using an index if available (the famous 5/95 rule). Unless you are prepared to patiently tune each query with appropriate hints, it's faster to just drop all indexes on the table. That will force the optimizer to use full scan in most of the cases (excepting when a primary key or a foreign key will be elected!).
     

    How to check that PQ is occuring efficiently:

    The first thing to check is that the parallel slave processes are running using a ps.
    The second thing is to check disk activities using iostat. If the query is well balanced, during the parallel phase of the query you should detect that all the disks carrying the stripped tables are working at the same time with a i/o rate (typically over 4Mb/s). Only those disks plus the temporary table space should work unless you experience system swapping. In fact by analysing the iostat you can detect the indexes access phase (i/o rate under 400k/s) or the sorting phase (idle time 0).
    Last Update : $Date: Jan 23 2002 15:39:00 $