datapump and the clustering factor



The 'clustering factor' has long been discussed with regard to performance tuning and there are lots of excellent articles on that subject. I'm not going to go into any great detail about how this is calculated or indeed how it affects the optimizer plan formulation. The example below is a simple one where i use datapump to change the clustering factor of an index. This may be a useful technique if you are doing a migration anyway but it's not anything you can't do by other means.

For the purpose of this example I'll just say the clustering factor means this:

If most of the rowids in an index block point to the same 'table' block then the clustering factor is 'small' i.e. an index range scan would result in fewer table blocks having to be accessed. A smaller clustering factor makes the index range scan more likely as less i/o is needed.

Hmm even that sounded a bit woolly.... (a picture explains it best - which i don't have)

Anyway lets do the example to show how datapump can be used to help change this number.

First up we create a table and populate with 1 million rows.


SQL> create table demo (col1 number,col2 number,col3 date);

Table created.


SQL> insert into demo select rownum,trunc(dbms_random.value(0,100)),sysdate
  2   from dual connect by level <= 1000000;

1000000 rows created.


This populates col1 with a unique number (from the rownum part of the insert), col2 with a random number between 0 and 100 and col3 with the current date.

Here is some sample data



      COL1       COL2 COL3
---------- ---------- ---------
       820         39 24-FEB-14
       821         55 24-FEB-14
       822         22 24-FEB-14
       823         17 24-FEB-14
       824         49 24-FEB-14
       825         70 24-FEB-14
       826         98 24-FEB-14
       827         41 24-FEB-14
       828         32 24-FEB-14
       829         10 24-FEB-14
       830         82 24-FEB-14



Now lets create 2 indexes, one on col1 and one on col2.

SQL> create index index1 on demo (col1);

Index created.

SQL> create index index2 on demo (col2);

Index created.

SQL>


Now lets look at a few stats about the index (these were autocalculated when the index built)



select index_name,NUM_ROWS,leaf_blocks,CLUSTERING_FACTOR from user_indexes
/                          

INDEX_NAME                       NUM_ROWS LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ----------- -----------------
INDEX1                            1000000        2226              2889
INDEX2                            1000000        1952            279845




As you can see index1 has a 'small' clustering factor as the rows were entered sequentially, an index block will contain large ranges of sequential numbers all in the same data blocks. Index2 has a very large clustering factor as the data inserted is randomized so a single index block will point to many data block rowids.

If we now do a quick select to get some stats and a plan we see the following:


select * from demo where col2=42;
9922 rows selected.


Statistics
----------------------------------------------------------
          9  recursive calls
          2  db block gets
       3645  consistent gets
        638  physical reads
        148  redo size
     271576  bytes sent via SQL*Net to client
      13657  bytes received via SQL*Net from client
        663  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       9922  rows processed


SQL> select * from table(dbms_xplan.display_cursor(null,null,'basic'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select * from demo where col2=42

Plan hash value: 4000794843

----------------------------------
| Id  | Operation         | Name |
----------------------------------
|   0 | SELECT STATEMENT  |      |
|   1 |  TABLE ACCESS FULL| DEMO |

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
----------------------------------



So a full table scan as the optimizer looked at the clustering factor (decided it was crap) and then went and did a full table scan as it thought this would be quicker.


Now lets do a straightforward export/import of the whole table and see what difference that makes to things.

[oracle@localhost ~]$ expdp rich/rich tables=demo

Export: Release 11.2.0.2.0 - Production on Mon Feb 24 08:21:42 2014

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "RICH"."SYS_EXPORT_TABLE_01":  rich/******** tables=demo
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 23 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/INDEX/INDEX
Processing object type TABLE_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
. . exported "RICH"."DEMO"                               19.05 MB 1000000 rows
Master table "RICH"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for RICH.SYS_EXPORT_TABLE_01 is:
  /home/oracle/app/oracle/admin/orcl/dpdump/expdat.dmp
Job "RICH"."SYS_EXPORT_TABLE_01" successfully completed at 08:22:06

[oracle@localhost ~]$ impdp rich/rich tables=demo table_exists_action=replace

Import: Release 11.2.0.2.0 - Production on Mon Feb 24 08:22:24 2014

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Master table "RICH"."SYS_IMPORT_TABLE_01" successfully loaded/unloaded
Starting "RICH"."SYS_IMPORT_TABLE_01":  rich/******** tables=demo table_exists_action=replace
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
. . imported "RICH"."DEMO"                               19.05 MB 1000000 rows
Processing object type TABLE_EXPORT/TABLE/INDEX/INDEX
Processing object type TABLE_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Job "RICH"."SYS_IMPORT_TABLE_01" successfully completed at 08:22:37


SQL> exec dbms_stats.gather_table_stats(USER,'DEMO',cascade=>TRUE);

PL/SQL procedure successfully completed.


So thats all reloaded lets see what the stats look like now:

select * from demo where col2=42
9922 rows selected.


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       3563  consistent gets
       2900  physical reads
          0  redo size
     271576  bytes sent via SQL*Net to client
      13657  bytes received via SQL*Net from client
        663  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       9922  rows processed

SQL> select * from table(dbms_xplan.display_cursor(null,null,'basic'));
EXPLAINED SQL STATEMENT:
------------------------
select * from demo where col2=42

Plan hash value: 4000794843

----------------------------------
| Id  | Operation         | Name |
----------------------------------
|   0 | SELECT STATEMENT  |      |
|   1 |  TABLE ACCESS FULL| DEMO |
----------------------------------


And it's essentially the same - we can see when we look at the clustering factor stats again that index2 is still 'big'

SQL> select index_name,NUM_ROWS,leaf_blocks,CLUSTERING_FACTOR from user_indexes
/    2 
INDEX1                            1000000        2226              5219
INDEX2                            1000000        1952            280764


 The reason for this? Well so far we haven't really changed anything. we dumped the table out and put it back in - this does nothing to change the order of the data that would influence the clustering factor. The numbers are slightly different but not by any amount that really changes things. So how do we improve the clustering of the data in col2?

This is where we can use datapump query. For the clustering factor to improve (i.e. get smaller) then the col2 rows need to be located close to each other in the table block - that way the index on col2 will be largely pointing to the same table block each time.

So how do we do that - easy add an order by to make the extract sort them as it unloads the data

So here is my parfile

[oracle@localhost ~]$ cat rich.par
include=TABLE:"='DEMO'"
query="where 1=1 order by col2"


Lets tidy up and then run the export

[oracle@localhost ~]$ rm /home/oracle/app/oracle/admin/orcl/dpdump/expdat.dmp
[oracle@localhost ~]$ expdp rich/rich parfile=rich.par

Export: Release 11.2.0.2.0 - Production on Mon Feb 24 08:30:25 2014

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "RICH"."SYS_EXPORT_SCHEMA_01":  rich/******** parfile=rich.par
Estimate in progress using BLOCKS method...
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 24 MB
Processing object type SCHEMA_EXPORT/TABLE/TABLE
Processing object type SCHEMA_EXPORT/TABLE/INDEX/INDEX
Processing object type SCHEMA_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type SCHEMA_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
. . exported "RICH"."DEMO"                               19.04 MB 1000000 rows
Master table "RICH"."SYS_EXPORT_SCHEMA_01" successfully loaded/unloaded
******************************************************************************
Dump file set for RICH.SYS_EXPORT_SCHEMA_01 is:
  /home/oracle/app/oracle/admin/orcl/dpdump/expdat.dmp
Job "RICH"."SYS_EXPORT_SCHEMA_01" successfully completed at 08:30:51


OK dumpfile now extracted which should have the data in col2 order now. Lets put it back in and see what happens


[oracle@localhost ~]$ impdp rich/rich tables=demo table_exists_action=replace

Import: Release 11.2.0.2.0 - Production on Mon Feb 24 08:31:15 2014

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Master table "RICH"."SYS_IMPORT_TABLE_01" successfully loaded/unloaded
Starting "RICH"."SYS_IMPORT_TABLE_01":  rich/******** tables=demo table_exists_action=replace
Processing object type SCHEMA_EXPORT/TABLE/TABLE
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
. . imported "RICH"."DEMO"                               19.04 MB 1000000 rows
Processing object type SCHEMA_EXPORT/TABLE/INDEX/INDEX
Processing object type SCHEMA_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type SCHEMA_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Job "RICH"."SYS_IMPORT_TABLE_01" successfully completed at 08:31:24


We need to gather stats again as the import puts the 'bad' stats back on - we need to regather fresh stats based on the data now.

SQL> exec dbms_stats.gather_table_stats(USER,'DEMO',cascade=>TRUE);

PL/SQL procedure successfully completed.


Now lets look at the clustering factor


SQL> select index_name,NUM_ROWS,leaf_blocks,CLUSTERING_FACTOR from user_indexes;

INDEX_NAME                       NUM_ROWS LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ----------- -----------------
INDEX1                            1000000        2226            990062
INDEX2                            1000000        1952              2905


And we can see that the clustering factor has drastically changed. It is now small for col2 and large for col1. So the data in the table is in col2 order and not col1 order.

Lets now try that select again

select * from demo where col2=42
9922 rows selected.


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1373  consistent gets
         31  physical reads
          0  redo size
     311264  bytes sent via SQL*Net to client
      13657  bytes received via SQL*Net from client
        663  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       9922  rows processed


SQL> select * from table(dbms_xplan.display_cursor(null,null,'basic'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select * from demo where col2=42

Plan hash value: 1024027054

----------------------------------------------
| Id  | Operation                   | Name   |
----------------------------------------------
|   0 | SELECT STATEMENT            |        |
|   1 |  TABLE ACCESS BY INDEX ROWID| DEMO   |
|   2 |   INDEX RANGE SCAN          | INDEX2 |
----------------------------------------------


14 rows selected.







So the plan has changed as there are now considerably less data blocks that need to be accessed to find where col2=42 as the data is 'clustered' together.

So in this particular case if this query and the clustering of this data is important for our performance then we should look at ordering the data in the table - perhaps by using the datapump method.

However - beware - the clustering factor for col1 is now 'crap' - so you may resolve one issue and cause another.

As with all performance issues - the key thing is to know your data.



Comments