Views expressed here are solely that of my own. Please make sure that you test the code/queries that appear on my blog before applying them to the production environment.

Sunday, July 17, 2011

Indexing only the needed column values to save index space and also to improve the performance

You can reduce the index space and also improve the performance of specific SQL statements by indexing only a specific value of a column by using function-based indexes in Oracle.

Below is a demonstration related to this.

First create a sample schema.
[oracle@localhost uural]$ sqlplus / as sysdba

SQL*Plus: Release 11.1.0.7.0 - Production on Sun Jul 17 05:01:51 2011

Copyright (c) 1982, 2008, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> @cre_user-U0001


User dropped.


User created.


Grant succeeded.


User altered.


Grant succeeded.


Grant succeeded.

SQL> !more cre_user-U0001.sql
drop user u0001 cascade;
create user u0001 identified by u0001;
grant create session to u0001;
alter user u0001 quota 5G on users;
grant create table to u0001;
grant dba to u0001;

Create a sample table with 1,000,000 rows with a status column which has the value of 'CLOSED'.
SQL> conn u0001/u0001
Connected.
SQL> create table t1 as
  2  select
  3    rownum id,
  dbms_random.string('U',6) v1,
  4    5    lpad(rownum,10) v2,
  6    rpad('x',100) padding,
  7    'CLOSED' status
  8  from
  dual
  9   10  connect by level <= 1000000;


Table created.

SQL> desc t1
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ID                                                 NUMBER
 V1                                                 VARCHAR2(4000)
 V2                                                 VARCHAR2(40)
 PADDING                                            VARCHAR2(100)
 STATUS                                             CHAR(6)

Now insert some 100 rows which have the status column as 'OPEN'.
SQL> insert into t1
  2  select
  3    rownum id,
  4    dbms_random.string('U',6) v1,
  5    lpad(rownum,10) v2,
  6    rpad('x',100) padding,
  7    'OPEN' status
  8  from
  9    dual
 10  connect by level <= 100;

100 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats('','t1',estimate_percent => null,cascade => true,method_opt => 'for all columns size 1');

PL/SQL procedure successfully completed.

SQL> set autot trace
SQL> sho autot
autotrace TRACEONLY EXPLAIN STATISTICS

Create an index on status column.
SQL> create index t1_01 on t1(status);

Index created.

SQL> exec dbms_stats.gather_table_stats('','t1',estimate_percent => null,cascade => true,method_opt => 'for all columns size 1');

PL/SQL procedure successfully completed.

SQL> alter system flush buffer_cache;

System altered.

Query the the rows which has the status column with the value 'OPEN'.
SQL> set timing on
SQL> select * from t1 where status = 'OPEN';

100 rows selected.

Elapsed: 00:00:01.94

Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   500K|    62M|  5220   (1)| 00:01:03 |
|*  1 |  TABLE ACCESS FULL| T1   |   500K|    62M|  5220   (1)| 00:01:03 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("STATUS"='OPEN')


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
      18897  consistent gets
      18887  physical reads
          0  redo size
       4229  bytes sent via SQL*Net to client
        485  bytes received via SQL*Net from client
          8  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        100  rows processed

As you can see the CBO did not use the index on status column most probably because of the low distinct values and low selectivity of the "status" column. It did 18887 "physical reads" and 18897 "consistent gets" and it took "01.94" seconds to get the result.

Now try to give a hint to make the CBO use the index on "status" column.
SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.25
SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.02

SQL> select /*+ index(t, t1_01)*/ * from t1 t where status = 'OPEN';

100 rows selected.

Elapsed: 00:00:00.04

Execution Plan
----------------------------------------------------------
Plan hash value: 2362378154

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time    |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |   500K|    62M| 10703   (1)| 00:02:09 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |   500K|    62M| 10703   (1)| 00:02:09 |
|*  2 |   INDEX RANGE SCAN          | T1_01 |   500K|       |  1262   (1)| 00:00:16 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("STATUS"='OPEN')


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         19  consistent gets
          5  physical reads
          0  redo size
      14839  bytes sent via SQL*Net to client
        485  bytes received via SQL*Net from client
          8  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        100  rows processed

As you can see above, CBO used the index and did 5 "physical reads" and 19 "consistent gets" and query took "00.04" seconds although the CBO computed the cost higher then Full Table Scan (FTS).

Lets look at the index structure
SQL> set autot off

SQL> select BLEVEL, LEAF_BLOCKS, DISTINCT_KEYS, AVG_LEAF_BLOCKS_PER_KEY, AVG_DATA_BLOCKS_PER_KEY, NUM_ROWS from user_indexes where index_name = 'T1_01';

    BLEVEL LEAF_BLOCKS DISTINCT_KEYS AVG_LEAF_BLOCKS_PER_KEY
---------- ----------- ------------- -----------------------
AVG_DATA_BLOCKS_PER_KEY   NUM_ROWS
----------------------- ----------
         2        2513             2                    1256
                   9435    1000100


Elapsed: 00:00:00.06
SQL> validate index t1_01;

Index analyzed.

Elapsed: 00:00:01.45

SQL> select HEIGHT,BLOCKS,LF_ROWS,LF_BLKS,BR_ROWS,BR_BLKS,DEL_LF_ROWS,DISTINCT_KEYS,USED_SPACE,PCT_USED,ROWS_PER_KEY,BLKS_GETS_PER_ACCESS from index_stats;

    HEIGHT     BLOCKS    LF_ROWS    LF_BLKS    BR_ROWS    BR_BLKS DEL_LF_ROWS
---------- ---------- ---------- ---------- ---------- ---------- -----------
DISTINCT_KEYS USED_SPACE   PCT_USED ROWS_PER_KEY BLKS_GETS_PER_ACCESS
------------- ---------- ---------- ------------ --------------------
         3       2688    1000100       2513       2512          8           0
            2   18051946         90       500050             250028.5


Elapsed: 00:00:00.73
You can see from the index stats that, this index ha 2513 leaf blocks which takes about 18 MB storage space.

Lets now drop this index and create a function-based index on "status" column but for only the value that we are interested in.
SQL> drop index t1_01;

Index dropped.

Elapsed: 00:00:00.60

We use the CASE statement, to tell Oracle that we only want the rows whose status column is "OPEN" to be indexed. Since the CASE will return NULL value for other values these rows' rowids will not be kept in the index structure.
SQL> create index t1_01 on t1(case when status = 'OPEN' then status else null end);

Index created.

Elapsed: 00:00:01.87
SQL> exec dbms_stats.gather_table_stats('','t1',estimate_percent => null,cascade => true,method_opt => 'for all columns size 1');

PL/SQL procedure successfully completed.

Elapsed: 00:00:17.51
SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.11
SQL> set autot trace

SQL> SQL> sho autot
autotrace TRACEONLY EXPLAIN STATISTICS

Now we need to use the same statement that we used while creating the function-based index, in the where condition of our query to make it use the new index.
SQL> select * from t1 where (case when status = 'OPEN' then status else null end) = 'OPEN';

100 rows selected.

Elapsed: 00:00:00.03

Execution Plan
----------------------------------------------------------
Plan hash value: 2362378154

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time    |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |   100 | 13200 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |   100 | 13200 |     3   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_01 |   100 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(CASE  WHEN "STATUS"='OPEN' THEN "STATUS" ELSE NULL END ='OPEN')


Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
         19  consistent gets
          4  physical reads
          0  redo size
       4229  bytes sent via SQL*Net to client
        485  bytes received via SQL*Net from client
          8  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        100  rows processed

From the trace we see that this statement did 4 "physical reads", 19 "consistent gets" and completed in "00.03" seconds. These values are very close to the values we got from the test using the non function-based index before.
But what about the storage space that this new index uses, lets check.
SQL> set autot off
SQL> select BLEVEL, LEAF_BLOCKS, DISTINCT_KEYS, AVG_LEAF_BLOCKS_PER_KEY, AVG_DATA_BLOCKS_PER_KEY, NUM_ROWS from user_indexes where index_name = 'T1_01';

    BLEVEL LEAF_BLOCKS DISTINCT_KEYS AVG_LEAF_BLOCKS_PER_KEY
---------- ----------- ------------- -----------------------
AVG_DATA_BLOCKS_PER_KEY   NUM_ROWS
----------------------- ----------
         0           1             1                       1
                      2        100


Elapsed: 00:00:00.01
SQL> validate index t1_01;

Index analyzed.

Elapsed: 00:00:00.04
SQL> select HEIGHT,BLOCKS,LF_ROWS,LF_BLKS,BR_ROWS,BR_BLKS,DEL_LF_ROWS,DISTINCT_KEYS,USED_SPACE,PCT_USED,ROWS_PER_KEY,BLKS_GETS_PER_ACCESS from index_stats;

    HEIGHT     BLOCKS    LF_ROWS    LF_BLKS    BR_ROWS    BR_BLKS DEL_LF_ROWS
---------- ---------- ---------- ---------- ---------- ---------- -----------
DISTINCT_KEYS USED_SPACE   PCT_USED ROWS_PER_KEY BLKS_GETS_PER_ACCESS
------------- ---------- ---------- ------------ --------------------
         1          8        100          1          0          0           0
            1       1800         23          100                 51.5

Elapsed: 00:00:00.04
As we can see, this new function-based index has only 1 leaf blocks and takes only 1.8 KB storage space which is far less compared to the values of non function-based index which took around 18 MB.

Now we can go one step further, suppose instead of having a status column, we have two date columns named "date_to_be_processed" and "date_processed". Every row is inserted by a batch process with their date_to_be_processed column set to a specific date normally one day later and when tomorrow comes another job selects only the rows with their "date_to_be_processed" column set to this date and "date_processed" column is set to "NULL". In that case we are not interested in the column which their "date_processed" values set already. Below is the demonstration of that case.

SQL> alter table t1 drop column status;


Table altered.

Elapsed: 00:02:00.87

SQL> alter table t1 add (date_processed date);

Table altered.

Elapsed: 00:00:00.38
SQL> alter table t1 add (date_to_be_processed date);

Table altered.

Elapsed: 00:00:00.05
SQL> truncate table t1;

Table truncated.

Elapsed: 00:00:02.35

We insert 1,000,000 rows which their "date_processed" columns are set. Every day has 100 rows processed.
SQL> insert into t1
  2      select
  3        rownum id,
  4        dbms_random.string('U',6) v1,
  5        lpad(rownum,10) v2,
  6        rpad('x',100) padding,
  7        trunc(sysdate) - (10000 - trunc((rownum-1)/100)) date_processed,
      trunc(sysdate) - (10000 - trunc((rownum-1)/100)) date_to_be_processed
  8    9      from
 10       dual
 11     connect by level <= 1000000;


1000000 rows created.

Elapsed: 00:01:04.80
SQL> set pause on
SQL> select date_processed, count(*) from t1 group by date_processed order by date_processed;


DATE_PROC   COUNT(*)
--------- ----------
29-FEB-84        100
01-MAR-84        100
02-MAR-84        100
03-MAR-84        100
04-MAR-84        100
05-MAR-84        100
06-MAR-84        100
07-MAR-84        100
08-MAR-84        100
09-MAR-84        100
10-MAR-84        100
11-MAR-84        100
...
Ctrl+C
13 rows selected.

Elapsed: 00:00:05.70

SQL> select date_processed, count(*) from t1 group by date_processed order by date_processed desc;


DATE_PROC   COUNT(*)
--------- ----------
16-JUL-11        100
15-JUL-11        100
14-JUL-11        100
13-JUL-11        100
12-JUL-11        100
11-JUL-11        100
10-JUL-11        100
09-JUL-11        100
08-JUL-11        100
07-JUL-11        100
06-JUL-11        100
05-JUL-11        100
...
Ctrl+C
13 rows selected.

Elapsed: 00:00:04.60

Now insert some 100 rows which should be processed today with their "date_processed" columnsset to "NULL".
SQL>  insert into t1
  2      select
  3        rownum id,
  4        dbms_random.string('U',6) v1,
  5        lpad(rownum,10) v2,
  6        rpad('x',100) padding,
  7        null date_processed,
  8        trunc(sysdate) date_to_be_processed
  9      from
 10       dual
 11     connect by level <= 100;

100 rows created.

Elapsed: 00:00:00.04
SQL> commit;

Commit complete.

Elapsed: 00:00:00.00
SQL> select date_processed, count(*) from t1 group by date_processed order by date_processed desc;


DATE_PROC   COUNT(*)
--------- ----------
                 100
16-JUL-11        100
15-JUL-11        100
14-JUL-11        100
13-JUL-11        100
12-JUL-11        100
11-JUL-11        100
10-JUL-11        100
09-JUL-11        100
08-JUL-11        100
07-JUL-11        100
06-JUL-11        100
...
Ctrl+C
13 rows selected.

Elapsed: 00:00:07.02

Create a regular composite index on these columns as we could normally do.
SQL> create index t1_01 on t1(date_processed, date_to_be_processed);

Index created.

Elapsed: 00:00:04.99
SQL> exec dbms_stats.gather_table_stats('','t1',estimate_percent => null,cascade => true,method_opt => 'for all columns size 1');

PL/SQL procedure successfully completed.

Elapsed: 00:00:24.62
SQL> set autot trace
SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.07

Now try to query the rows that we are interested.
SQL> select * from t1 where date_processed is null and date_to_be_processed = trunc(sysdate);

100 rows selected.

Elapsed: 00:00:00.08

Execution Plan
----------------------------------------------------------
Plan hash value: 2362378154

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time    |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |     1 |   140 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |     1 |   140 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_01 |     1 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("DATE_PROCESSED" IS NULL AND
              "DATE_TO_BE_PROCESSED"=TRUNC(SYSDATE@!))
       filter("DATE_TO_BE_PROCESSED"=TRUNC(SYSDATE@!))


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         21  consistent gets
          5  physical reads
        188  redo size
      15117  bytes sent via SQL*Net to client
        485  bytes received via SQL*Net from client
          8  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        100  rows processed

SQL> select BLEVEL, LEAF_BLOCKS, DISTINCT_KEYS, AVG_LEAF_BLOCKS_PER_KEY, AVG_DATA_BLOCKS_PER_KEY, NUM_ROWS from user_indexes where index_name = 'T1_01'
  2

SQL> set autot off
SQL> select BLEVEL, LEAF_BLOCKS, DISTINCT_KEYS, AVG_LEAF_BLOCKS_PER_KEY, AVG_DATA_BLOCKS_PER_KEY, NUM_ROWS from user_indexes where index_name = 'T1_01';


    BLEVEL LEAF_BLOCKS DISTINCT_KEYS AVG_LEAF_BLOCKS_PER_KEY
---------- ----------- ------------- -----------------------
AVG_DATA_BLOCKS_PER_KEY   NUM_ROWS
----------------------- ----------
         2        3774         10001                       1
                      2    1000100


Elapsed: 00:00:00.60
SQL> validate index t1_01;

Index analyzed.

Elapsed: 00:00:00.97
SQL> select HEIGHT,BLOCKS,LF_ROWS,LF_BLKS,BR_ROWS,BR_BLKS,DEL_LF_ROWS,DISTINCT_KEYS,USED_SPACE,PCT_USED,ROWS_PER_KEY,BLKS_GETS_PER_ACCESS from index_stats;


    HEIGHT     BLOCKS    LF_ROWS    LF_BLKS    BR_ROWS    BR_BLKS DEL_LF_ROWS
---------- ---------- ---------- ---------- ---------- ---------- -----------
DISTINCT_KEYS USED_SPACE   PCT_USED ROWS_PER_KEY BLKS_GETS_PER_ACCESS
------------- ---------- ---------- ------------ --------------------
         3       3968    1000100       3774       3773         15           0
        10001   27107816         90          100                 53.5


Elapsed: 00:00:04.46

It did 21 "consistent gets", 5 "physical reads" and it took "00.08" seconds to complete. The index we created has 3773 leaf blocks and it took about 27 MB storage space.

Now try a function-based index on only specific values we interested in. We tell Oracle that put into index "date_to_be_processed" values of only the rows which has "date_processed" columns are null.
SQL> set lines 120
SQL> drop index t1_01;

Index dropped.

Elapsed: 00:00:00.34
SQL> create index t1_01 on t1(case when date_processed is null then date_to_be_processed else null end);

Index created.

Elapsed: 00:00:02.22
SQL> exec dbms_stats.gather_table_stats('','t1',estimate_percent => null,cascade => true,method_opt => 'for all columns size 1');

PL/SQL procedure successfully completed.

Elapsed: 00:00:17.58
SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.03
SQL> set autot trace

SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.04
SQL> select * from t1 where (case when date_processed is null then date_to_be_processed else null end) = trunc(sysdate);

100 rows selected.

Elapsed: 00:00:00.02

Execution Plan
----------------------------------------------------------
Plan hash value: 2362378154

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |   100 | 14100 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |   100 | 14100 |     3   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_01 |   100 |       |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access(CASE  WHEN "DATE_PROCESSED" IS NULL THEN
              "DATE_TO_BE_PROCESSED" ELSE NULL END =TRUNC(SYSDATE@!))


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         17  consistent gets
          3  physical reads
          0  redo size
       4311  bytes sent via SQL*Net to client
        485  bytes received via SQL*Net from client
          8  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        100  rows processed

SQL> set autot off
SQL> select BLEVEL, LEAF_BLOCKS, DISTINCT_KEYS, AVG_LEAF_BLOCKS_PER_KEY, AVG_DATA_BLOCKS_PER_KEY, NUM_ROWS from user_indexes where index_name = 'T1_01';


    BLEVEL LEAF_BLOCKS DISTINCT_KEYS AVG_LEAF_BLOCKS_PER_KEY AVG_DATA_BLOCKS_PER_KEY   NUM_ROWS
---------- ----------- ------------- ----------------------- ----------------------- ----------
         0           1             1                       1                       2        100

Elapsed: 00:00:00.98
SQL> validate index t1_01;

Index analyzed.

Elapsed: 00:00:00.18
SQL> select HEIGHT,BLOCKS,LF_ROWS,LF_BLKS,BR_ROWS,BR_BLKS,DEL_LF_ROWS,DISTINCT_KEYS,USED_SPACE,PCT_USED,ROWS_PER_KEY,BLKS_GETS_PER_ACCESS from index_stats;


    HEIGHT     BLOCKS    LF_ROWS    LF_BLKS    BR_ROWS    BR_BLKS DEL_LF_ROWS DISTINCT_KEYS USED_SPACE   PCT_USED
---------- ---------- ---------- ---------- ---------- ---------- ----------- ------------- ---------- ----------
ROWS_PER_KEY BLKS_GETS_PER_ACCESS
------------ --------------------
         1          8        100          1          0          0           0             1       1900         24
         100                 51.5


Elapsed: 00:00:00.78
SQL>
Now we see that it did 17 "consistent gets, 3 "physical reads" and it took just "00.02" seconds to get the results which is also a better value compared to "00.08" of previous normal index test, but when you compare the index structure, new index has only 1 leaf block which took 1.9 KB storage space which is far less compared to the 27 MB which was used in a normal index test.

These examples are suitable for very selective cases, I mean, in these tests you are not interested in the old historical values, you only need to select the new values or values which we are interested in. But especially, from the storage point of view, if the table will grow very huge to billion of rows and you will not query the old historical values very often then this method is very well suits your needs.

No comments: