Column compress values from statistics

Short teaser:

Get multi value compression out of column statistics for free

Cover Image:

Besides collecting statistics on your columns on your Teradata database, the compressing of the data to save disk space is a very important maintaining task. So why not connect these two tasks? The idea is to extract the values for the multi value compression of the columns out of the collected statistics.

The idea

Starting with Teradata V14 the "SHOW STATISTICS VALUES COLUMN col ON db.tab; " prints out as a text (optionally as XML) the results of the last collection of statistics in detail. The output in text form is exactly the command to insert the results of the collection back into the database. The command prints a lot of lines. The following are interesting for the algorithm:

...

/* NumOfNulls */ 20,
...

/* NumOfRows */ 3180,
...

/** Biased: Value, Frequency **/
/* 1 */ 'N', 3147,
/* 2 */ 'Y', 13
...

Specially the biased values block show the values of the column, which are very often in the data. And these values can be taken for compressing of the column.

The column for compression has to have the following requirements:

Statistic has to be representative and actual, but could be sampled
Column is not allowed to be part of index or partition
The statistics values must have the correct length
It is not allowed to have statistics on the column

In Teradata 14 all statistics values are limited to 26 characters. To get the not trimmed values you have to use the "USING MAXVALUELENGTH" clause during the collect statistics command.

The other fact disturb the algorithm more: You cannot change a column when there is an statistic on it.

The advantages are:

No costs for getting the values for compression
Good compression results with easy algorithm

This easy solution for fitting on one page has some disadvantages:

Procedure doesn't take care of previous values list
Algorithm doesn't take care of multi columns collect statistics

The algorithm

First we execute for each column with statistics of the table to compress the "SHOW STATISTICS VALUES COLUMN". From this output we take the numbers of null and the values of the biased values block. From the number of occurences we decide which values come into the multi value compress list. At the moment each value has to have an estimation of more than 1% in the data. With this limit it could not happen that we have more than 100 compress values. In parallel we create a "DROP STATISTICS" and the "COLLECT STATISTICS COLUMN ... ON ... VALUES (...);" to put the statistics back. With this three files we first drop the statistics, perform the alter table statement and after that put the statistics back.

The process

The algorithm consists of a sql file and an awk script. The sql file gets the "SHOW STATISTICS VALUES COLUMN" for the columns for the tables in an useful ordering:

SHOW STATISTICS VALUES COLUMN col1 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col2 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col3 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col1 on dbtest.tab2;
SHOW STATISTICS VALUES COLUMN col2 on dbtest.tab2;

These commands have to be executed by bteq and stored in one file. The awk script takes this file and produces a larger file:

DROP STATISTICS column col1 on dbtest.tab1;
DROP STATISTICS column col2 on dbtest.tab1;
DROP STATISTICS column col3 on dbtest.tab1;
ALTER TABLE dbtest.tab1 add col1 compress ( ...)
, add col2 compress ( ...)
, add col3 compress ( ...)
;
COLLECT STATISTICS COLUMN ( col1 ) ON dbtest.tab1 VALUES (...);
COLLECT STATISTICS COLUMN ( col2 ) ON dbtest.tab1 VALUES (...);
COLLECT STATISTICS COLUMN ( col3 ) ON dbtest.tab1 VALUES (...);


DROP STATISTICS column col1 on dbtest.tab2;
DROP STATISTICS column col2 on dbtest.tab2;
ALTER TABLE dbtest.tab2 add col1 compress ( ...)
, add col2 compress ( ...)
;
COLLECT STATISTICS COLUMN ( col1 ) ON dbtest.tab2 VALUES (...);
COLLECT STATISTICS COLUMN ( col2 ) ON dbtest.tab2 VALUES (...);

Executing these statements perform the compression. Finished.

The source code

SQL File

SELECT
         'SHOW STATISTICS VALUES COLUMN '||(trim (both from a.columnname))||' on '||(trim(both from a.databasename))||'.'||(trim(both from a.tablename))||';' as stmt
FROM
        dbc.ColumnStatsV a
INNER JOIN
        dbc.columns b
ON
a.databasename=b.databasename
        AND
a.tablename=b.tablename
        AND
a.columnname=b.columnname
LEFT OUTER JOIN
        dbc.PartitioningConstraintsV c
ON
a.databasename=c.databasename
        AND
a.tablename=c.tablename
        AND
upper(c.constrainttext) LIKE '%'||(upper(a.columnname))||'%'

WHERE
        c.constrainttext is null
        AND
a.indexnumber is null
        AND
a.databasename='${DB}'        AND
(a.databasename,a.tablename,a.columnname) not in (select databasename,tablename,columnname from dbc.indices)
        AND
(a.databasename,a.tablename) in (select databasename,tablename from dbc.tables where tablekind='T')
order by a.databasename,a.tablename,a.columnname;

AWK File

BEGIN   { CUTPERCENTAGE=1;
          print ".errorlevel (3582) severity 0";
          print ".errorlevel (6956) severity 0";
          print ".errorlevel (5625) severity 0";
          print ".errorlevel (3933) severity 0";
        }
/            COLUMN \(/ { COL=$3; }
/                ON / { DBTAB=$2; }
/^ \/\*\* / { BIASEDON=0; }
/Data Type and Length/ { DATATYPE=substr($6,2,2); }
/NumOfRows/ { CUTROWS=$4*CUTPERCENTAGE/100;
              BIASED=="";
              if (0+CUTROWS<0+NULLROWS) BIASED="NULL,";
            }
/\/\* NumOfNulls/ { NULLROWS=$4; }
/^ \/\*\* Biased:/ { if (DATATYPE!="TS"&& DATATYPE!="AT"&& DATATYPE!="DS") BIASEDON=1;}
/^ \/\* / { if (BIASEDON==1)
                {
                if (CUTROWS < 0+gensub(".*,","","",gensub(",? ?$","","g")))
                        {
                        BIAS=gensub("^ */[^/]*/","","g",gensub(",[0-9 ]*,? ?$","","g"));
                        if (index (BIASED,BIAS)==0)
                                BIASED=BIASED BIAS ",";
                        }
                }
        }
/^COLLECT STATISTICS/   { COLSTATON=1; }
        {       if (COLSTATON==1) COLSTAT=COLSTAT "\n" $0; }

/^);/   { BIASEDON=0;
        COLSTATON=0;
        if (BIASED=="")
                {
                COLSTAT="";
                next;
                }
        if (DBTAB!=DBTABOLD)
                {
                if (DBTABOLD!="")
                        {
                        print DROPSTATS;
                        print ALTERTABLE ";";
                        print COLSTATALL;
                        COLSTATALL="";
                        }
                ALTERTABLE="ALTER TABLE " DBTAB " ADD " COL " COMPRESS (" gensub(",$","","",BIASED) ")";
                DROPSTATS="DROP STATISTICS COLUMN " COL " ON " DBTAB ";";
                DBTABOLD=DBTAB;
                }
        else
                {
                ALTERTABLE=ALTERTABLE "\n""        ,ADD " COL " COMPRESS (" gensub(",$","","",BIASED) ")";
                DROPSTATS=DROPSTATS "\n""DROP STATISTICS COLUMN " COL " ON " DBTAB ";";
                }
        COLSTATALL=COLSTATALL "\n" COLSTAT;
        COLSTAT="";
        BIASED="";
        }

END     {
        print ";";
        }

First Results and Motivation

As a teradata customer we run a Appliance instance with about 10 TB of user data. In a few hours running these scripts we decreased our space by 20%.

Unfortunately this is the only instance I can test the scripts at the moment, so further improvements and remarks are very welcomed.

Last, but not least, thanks to Dieter Nöth (dnoeth) for the tipps.

Ignore ancestor settings:

Tags:

Apply supersede status to children:

Column compress values from statistics

The idea

The algorithm

The process

The source code

First Results and Motivation

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112