Quantcast
Channel: Teradata Downloads
Viewing all articles
Browse latest Browse all 780

Column compress values from statistics

$
0
0
Short teaser: 
Get multi value compression out of column statistics for free
Cover Image: 
Besides collecting statistics on your columns on your Teradata database, the compressing of the data to save disk space is a very important maintaining task. So why not connect these two tasks? The idea is to extract the values for the multi value compression of the columns out of the collected statistics.
 

The idea

 
Starting with Teradata V14 the "SHOW STATISTICS VALUES COLUMN col ON db.tab; " prints out as a text (optionally as XML) the results of the last collection of statistics in detail. The output in text form is exactly the command to insert the results of the collection back into the database. The command prints a lot of lines. The following are interesting for the algorithm:
 
...
 /* NumOfNulls            */ 20,
...
 /* NumOfRows             */ 3180,
...
 /** Biased: Value, Frequency **/
 /*   1 */   'N', 3147,
 /*   2 */   'Y', 13
...
 
Specially the biased values block show the values of the column, which are very often in the data. And these values can be taken for compressing of the column.
 
The column for compression has to have the following requirements:
  • Statistic has to be representative and actual, but could be sampled
  • Column is not allowed to be part of index or partition
  • The statistics values must have the correct length
  • It is not allowed to have statistics on the column

In Teradata 14 all statistics values are limited to 26 characters. To get the not trimmed values you have to use the "USING MAXVALUELENGTH" clause during the collect statistics command.

The other fact disturb the algorithm more: You cannot change a column when there is an statistic on it.
 
The advantages are:
  • No costs for getting the values for compression
  • Good compression results with easy algorithm

This easy solution for fitting on one page has some disadvantages:

  • Procedure doesn't take care of previous values list
  • Algorithm doesn't take care of multi columns collect statistics

The algorithm

First we execute for each column with statistics of the table to compress the "SHOW STATISTICS VALUES COLUMN". From this output we take the numbers of null and the values of the biased values block. From the number of occurences we decide which values come into the multi value compress list. At the moment each value has to have an estimation of more than 1% in the data. With this limit it could not happen that we have more than 100 compress values. In parallel we create a "DROP STATISTICS" and the "COLLECT STATISTICS COLUMN ... ON ... VALUES (...);" to put the statistics back. With this three files we first drop the statistics, perform the alter table statement and after that put the statistics back.

The process

The algorithm consists of a sql file and an awk script. The sql file gets the "SHOW STATISTICS VALUES COLUMN" for the columns for the tables in an useful ordering:

SHOW STATISTICS VALUES COLUMN col1 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col2 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col3 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col1 on dbtest.tab2;
SHOW STATISTICS VALUES COLUMN col2 on dbtest.tab2;

These commands have to be executed by bteq and stored in one file. The awk script takes this file and produces a larger file:

DROP STATISTICS column col1 on dbtest.tab1;
DROP STATISTICS column col2 on dbtest.tab1;
DROP STATISTICS column col3 on dbtest.tab1;
ALTER TABLE dbtest.tab1 add col1 compress ( ...)
, add col2 compress ( ...)
, add col3 compress ( ...)
;
COLLECT STATISTICS COLUMN ( col1 ) ON dbtest.tab1 VALUES (...);
COLLECT STATISTICS COLUMN ( col2 ) ON dbtest.tab1 VALUES (...);
COLLECT STATISTICS COLUMN ( col3 ) ON dbtest.tab1 VALUES (...);


DROP STATISTICS column col1 on dbtest.tab2;
DROP STATISTICS column col2 on dbtest.tab2;
ALTER TABLE dbtest.tab2 add col1 compress ( ...)
, add col2 compress ( ...)
;
COLLECT STATISTICS COLUMN ( col1 ) ON dbtest.tab2 VALUES (...);
COLLECT STATISTICS COLUMN ( col2 ) ON dbtest.tab2 VALUES (...);

 

Executing these statements perform the compression. Finished.

The source code

SQL File

SELECT
         'SHOW STATISTICS VALUES COLUMN '||(trim (both from a.columnname))||' on '||(trim(both from a.databasename))||'.'||(trim(both from a.tablename))||';' as stmt
FROM
        dbc.ColumnStatsV a
INNER JOIN
        dbc.columns b
ON
a.databasename=b.databasename
        AND
a.tablename=b.tablename
        AND
a.columnname=b.columnname
LEFT OUTER JOIN
        dbc.PartitioningConstraintsV c
ON
a.databasename=c.databasename
        AND
a.tablename=c.tablename
        AND
upper(c.constrainttext) LIKE '%'||(upper(a.columnname))||'%'

WHERE
        c.constrainttext is null
        AND
a.indexnumber is null
        AND
a.databasename='${DB}'        AND
(a.databasename,a.tablename,a.columnname) not in (select databasename,tablename,columnname from dbc.indices)
        AND
(a.databasename,a.tablename) in (select databasename,tablename from dbc.tables where tablekind='T')
order by a.databasename,a.tablename,a.columnname;

AWK File

BEGIN   { CUTPERCENTAGE=1;
          print ".errorlevel (3582) severity 0";
          print ".errorlevel (6956) severity 0";
          print ".errorlevel (5625) severity 0";
          print ".errorlevel (3933) severity 0";
        }
/            COLUMN \(/ { COL=$3; }
/                ON / { DBTAB=$2; }
/^ \/\*\* / { BIASEDON=0; }
/Data Type and Length/ { DATATYPE=substr($6,2,2); }
/NumOfRows/ { CUTROWS=$4*CUTPERCENTAGE/100;
              BIASED=="";
              if (0+CUTROWS<0+NULLROWS) BIASED="NULL,";
            }
/\/\* NumOfNulls/ { NULLROWS=$4; }
/^ \/\*\* Biased:/ { if (DATATYPE!="TS"&& DATATYPE!="AT"&& DATATYPE!="DS") BIASEDON=1;}
/^ \/\* / { if (BIASEDON==1)
                {
                if (CUTROWS < 0+gensub(".*,","","",gensub(",? ?$","","g")))
                        {
                        BIAS=gensub("^ */[^/]*/","","g",gensub(",[0-9 ]*,? ?$","","g"));
                        if (index (BIASED,BIAS)==0)
                                BIASED=BIASED BIAS ",";
                        }
                }
        }
/^COLLECT STATISTICS/   { COLSTATON=1; }
        {       if (COLSTATON==1) COLSTAT=COLSTAT "\n" $0; }

/^);/   { BIASEDON=0;
        COLSTATON=0;
        if (BIASED=="")
                {
                COLSTAT="";
                next;
                }
        if (DBTAB!=DBTABOLD)
                {
                if (DBTABOLD!="")
                        {
                        print DROPSTATS;
                        print ALTERTABLE ";";
                        print COLSTATALL;
                        COLSTATALL="";
                        }
                ALTERTABLE="ALTER TABLE " DBTAB " ADD " COL " COMPRESS (" gensub(",$","","",BIASED) ")";
                DROPSTATS="DROP STATISTICS COLUMN " COL " ON " DBTAB ";";
                DBTABOLD=DBTAB;
                }
        else
                {
                ALTERTABLE=ALTERTABLE "\n""        ,ADD " COL " COMPRESS (" gensub(",$","","",BIASED) ")";
                DROPSTATS=DROPSTATS "\n""DROP STATISTICS COLUMN " COL " ON " DBTAB ";";
                }
        COLSTATALL=COLSTATALL "\n" COLSTAT;
        COLSTAT="";
        BIASED="";
        }

END     {
        print ";";
        }

 

First Results and Motivation

As a teradata customer we run a Appliance instance with about 10 TB of user data. In a few hours running these scripts we decreased our space by 20%.

Unfortunately this is the only instance I can test the scripts at the moment, so further improvements and remarks are very welcomed.

Last, but not least, thanks to Dieter Nöth (dnoeth) for the tipps.

Ignore ancestor settings: 
0
Apply supersede status to children: 
0

Viewing all articles
Browse latest Browse all 780

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>