Quantcast
Channel: Teradata Downloads
Viewing all articles
Browse latest Browse all 780

K-means clustering and Teradata 14.10 table operators

$
0
0
Short teaser: 
This article discusses how to implement a Teradata 14.10 table operator using K-means clustering
Cover Image: 
AttachmentSize
td_kmeans.zip8.06 KB

Table Operators

This article discusses how to implement a Teradata 14.10 table operator using K-means clustering as an example use case. 

A Teradata release 14.10 table operator is a type of function that can accept an arbitrary row format and based on operation and input row type can produce an arbitrary output row format.  The name is derived from the concept of a database physical operator. A physical operator takes as input one or more data streams and produces an output data stream. Examples of physical operators are join, aggregate, scan etc. The notion is that from the perspective of the function implementer, programmer, a new physical operator can be implemented that has complete control of the Teradata parallel "step" processing. From the function user, SQL writer, perspective it is very analogues to the concept of a "from clause" table function.

The table operator feature was originally motived by requirements driven by in database advanced analytics use cases. In addition the Table operator feature enables a simplified in database Map Reduce style programming model. 

A table operator can be system defined or user defined. Teradata release 14.10 introduces three new system defined table operators:

  • TD_UNPIVOT which transforms columns into rows based on the syntax of the unpivot expression.
  • CalcMatrix which calculates a Correlation, Covariance or Sums of Squares and Cross Products matrix
  • LOAD_FROM_HCATALOG which is used for accessing the Hadoop file system.

The following sections describe implementing the k-means clustering algorithm using a C language table operator as an example. The algorithm will be discussed in terms of Map Reduce model because of its current popularity but in reality it involves nothing more than a local and global aggregation phase. Further in the context of this article the following can be considered synonymous, observations ~ rows and variables ~ columns. References will be made to the source code which is available in the attached zip file.

Note, developer exchange SQL format tool has some current issues and therefore all SQL is embedded as a GIF image.

k-means Background

k-means clustering is an iterative algorithm whose goal is to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.  See the following reference for more information http://en.wikipedia.org/wiki/K-means_clustering

The algorithm is programmatically simple but computationally complex. It could be defined using existing SQL expressions but the number of required expressions would approximately be K*V (V number of Variables).  Furthermore there is a product joined required  between the observation table and the cluster table. The SQL solution would be limited to small values of K*V.

Fortunately k-means fits well with the Teradata shared nothing parallelism and extensibility architecture.  Parallelization is straightforward and based on the communicative definition of the algorithm which requires computing the per AMP mean variable distance subtotals and then computing the global mean distance totals grouping by the cluster identifier.  The algorithm has the following characteristics:

  1. Map Phase: On each AMP assign the observations to the nearest cluster based on distance. Distance can be based on one of several vector distance algorithms; Euclidian, Cosine, etc.
  2. Reduce Phase: For each cluster aggregate the local cluster summation and count subtotals to calculate the new cluster means.
  3. The memory requirements are (2 * K * V * size of double). With the default memory allocation limit of 32MB per invocation you can support a k value of approximately 100,000 with 20 variables. The memory limit default can be increased using cufconfig, doing so requires an understanding of the system workload and current memory settings. The memory requirements are independent of the number of observations.
  4. Each iteration of the algorithm reads the current cluster definitions and observations and output the new cluster definition
    1. In most cases the observations will reside in memory (FSG cache) between iterations. See DBQL data in performance section for 100,000,000 rows on a single node 1580.
    2. The cluster definitions will always reside in memory (FSG cache) since they are nominal in size. For K=100,000, V=20 would require 16,000,000 bytes of storage.

The k-means process flow

The SPL stored procedure in the attached zip file implements the following process flow. The relevant "steps" are labeled within the SPL source code.

Step 1:

Prepare input observations: The input observation requirement is a table or view whose schema is a single big integer key column and N variable columns that can be cast to a FLOAT data type. If standardizing the variables through scaling is deseried it can be done as part of the observation data preparation. For example z score, (value – mean) / standard deviation see http://en.wikipedia.org/wiki/Standard_score, scaling can be done using a simple expression as follows .

Any null variable values will result in the entire observation being treated as a null value. Logically the input row is treated as a vector, if one element is null then the entire vector is null. Rows with a null value are assigned to cluster -1.

Step 2:

User invokes process by calling procedure "run_kmeans". This is the SQL user interface and the remaining steps 3-8 are all internal to the stored procedure. The stored procedure has seven inputs and one output parameter as follows:

  1. VARCHAR(128):  Input table/view database name
  2. VARCHAR(128):  Input table name, 1 big integer key column and N numeric variable columns.
  3. VARCHAR(128):  Output cluster table name. This is a volatile temporary table created in the current default database spool space. This table contains the new cluster definitions and has K rows.
  4. VARCHAR(128):  Output data table name or empty string if not deseried.  This is a volatile temporary table created in the current default database spool space.  This table contains the input table data with an appended cluster identifier column.
  5. INTEGER:  K value, number of deseried clusters, (1-100,000)
  6. INTEGER:  A value specifying the Iterations limit.  If the algorithm does not converge than it will only execute this many iterations.
  7. FLOAT:  A numeric value specifying the convergence threshold. The algorithm will terminate when the new cluster definitions and current cluster definitions cumulative differences is less than this value.
  8. VARCHAR(128):  Stored Procedure execution result status. This is the only output parameter. Note the procedure will create a logging volatile table named results.

Step 3:

Create work table "kin" which contains observation values and cluster definitions.

Step 4:

Generate initial K cluster definitions using SAMPLE clause and store in initial cluster table.

Step 5:

Start of iteration

Delete prior cluster definitions from work table "kin", on first iteration there will not be any prior definitions.

Insert cluster definitions, and duplicate to all AMPs, into work table "kin".

Step 6:

Create output cluster table by involving td_kmeans table operator ordering the data be cluster identifier. This implements the map phase.  The output of the table operator is aggregated using a SUM aggregation function. This implements the reduce phase.

The function td_kmeans will read the input work table rows first accessing the duplicated cluster definitions rows and then accessing the observation rows. The td_kmeans table operator input row must contain three columns of the form:

  1. INTEGER:  Cluster identifier. For the cluster definitions rows the cluster number will be a value between 1 and K.  For the data variable rows the cluster number will be K+1
  2. BIG INTEGER:  Key value
  3. FLOAT: 1-N variables

The output row format to the function td_kmeans for phase = 1, generate clusters, will be of the form

  1. INTEGER: Cluster identifier. The cluster number will be value between 1 and K.
  2. BIG INTEGER:  Cluster count, the per AMP subtotal of rows in this cluster.
  3. FLOAT: 1-N variables, the per AMP subtotal of variable summations

The output row format to the function td_kmeans for phase = 2, generate data, will be of the form

  1. INTEGER:  Cluster identifier. The cluster number will be value between 1 and K.
  2. BIG INTEGER:  Key value
  3. FLOAT:  1-N variables

An example of the generated map/reduce SQL statement for 4 variables, K=3 and phase=1 is

Some items to note on the SQL syntax

  • The ON clause simply contains the name of the input work table "kin". This is simply syntactic sugar for "SELECT * FROM kin"
  • The LOCAL ORDER BY clause is used to order the data by cluster identifier, this results in the cluster definition rows being read prior to the observation rows.
  • They are two USING clauses Key Vale pairs. The first key "K" is the K value in the k-means process. The second Key "phase" determines whether the function will output the cluster definitions (1) or output the observations with the appended cluster identifier (2).
  • Reduce Phase is implemented using the SQL SUM function rather than a REDUCE table operator. The main motivation was to show how you can intermix functionality. There are also some potential performance benefits because the Teradata optimizer and execution engine has robust mechanisms to handle local/global processing in a parallel environment.

Step 7:

This is the final step in the iteration loop. Check for termination either due to iteration count being reached or convergence threshold being meet. If not terminating continue iterating at step  5

Step 8:

If output table name string is not empty then execute td_kmeans table operator again with phase=2 to assign variables to final cluster definitions and save in output table.

Programming Details

Now that we have described the k-means process flow let's review the details of the map phase table operator td_kmeans.

Key data structures

  1. context *ctx: A C language structure that is used to simplify passing variables between subroutines.
  2. ctx->Cluster: A two dimensional array of doubles. It represents the current cluster definition values.  It contains a row for each cluster (K) and a column for each variable (V) plus an additional column which holds the cluster identifier.  If the cluster is NULL than the cluster identifier will be negative.
  3. ctx->NewCluster: A two dimensional array of doubles. It represents the new current cluster definition values.  It contains a row for each cluster (K) and a column for each variable (V) summation plus an additional column which holds the cluster count. This structure is per AMP and therefore it requires a reduce / aggregation step to create the final new cluster definitions  

Diagram of architecture components:

Key programming routines

  1. td_kmeans_contract(): implements the table operator contract function
  2. td_kmeans(): implements the table operator main body
  3. add_cluster(): adds the current cluster definition input row to the ctx->Cluster data structure
  4. add_row(): adds the current observation input row to the ctx->NewCluster data structure.
  5. SquaredEuclideanDistance(): called by add_row() to calculate the distance between the current observation row and one of the cluster definition elements in ctx->Cluster.
  6. output_cluster(): writes the ctx->NewCluster data structure to database spool.
  7. output_row(): called by add_row() to write an observation with an assigned cluster identifier to database spool.

Table Operator Contract Function

The contract function is called at query parse time and executes within the parser  (disstart process). In the C language the entry point name is defined by the create function DDL subclause "RETURNS TABLE VARYING USING FUNCTION". It should be noted that the contract function is also executed during an explain of a query statement containing a table operator.

The main purpose of the contact function is to define the return row format. Other capabilities are:

  1. Validate the number and data types of the input row columns.
  2. Check if a hash by and/or local order has been specified or define a required hash by local order by statement
  3. Check the correctness of the USING clause strings key-value pairs.
  4. Produce a data structure that can be passed to the actual operator invocation on all AMPs.

Note, all FNC strings are in the character set, Latin or Unicode, that was in effect for the session that the table operator was created within. This is the same behavior as existing UDFs. Also the contract function behaves similar to a scalar function related to setting error SQL codes and messages. 

Let's take a closer looking at main tasks associated with setting the output row format. The steps are as follows

  1. Determine the number of output columns, this is based on the specific requirements of the table operator. The limit is the same of the Teradata table column limit, currently 2,048.
  2. Allocate memory to hold the output column structure (FNC_TblOpColumnDef_t). The macro TblOpSIZECOLDEF can be used to allocate memory.
  3. Initialize the allocated memory by calling macro TblOpINITCOLDEF.
  4. For each output column minimally set the column name and the appropriate data type attribute. Character columns also require setting the charset_et enum and size.length,  decimal types require setting the size.range and time types require setting the size.precision. The following structures FNC_TblOpColumnDef_t and parm_tx are defined in the system include file sqltypes_td.h
typedef struct
{
    int length; // number of bytes representing column types
    int num_columns; // number of columns in stream
    parm_tx column_types[1]; // an array of column types, one for each column.
} FNC_TblOpColumnDef_t;

 

typedef struct { 
   dtype_et datatype; // data type of attribute (e.g., INTEGER_DT) 
   CHARACTER column[FNC_MAXNAMELEN_EON]; // name of the column 
   CHARACTER udt_type[FNC_MAXNAMELEN_EON]; // name of the UDT 
   charset_et charset; // character set (e.g., UNICODE) 
   period_en period_et; // granularity of period type 
   int bytesize; // maximum size of fixed-length fields 
   union { 
      long length; // length in CHAR/VARCHAR/BYTE types 
      int intervalrange; // range for interval types 
      int precision; // precision for time/timestamp types 
      struct { 
        int totaldigit; // n in DECIMAL(n, m) 
        int fracdigit; // m in DECIMAL(n, m) 
        } range; 
      } size;
 } parm_tx 

Table Operator

The table operator is always executed on the AMP within a return step (stpret in DBQL).  This implies that it can read from spool, base table, PPI partition, index structure, etc. and will always write its output to spool. Some concepts related to the operator execution.

If a HASH BY and/or a LOCAL ODER BY is specified the input data will always be spooled to enforce the HASH BY geography and the LOCAL ORDER BY ordering within the AMP. HASH BY can be used to assign rows to a AMP and LOCAL ORDER BY can be used to order the rows within a AMP. You can specify either or both of the clauses independently.

If a PARTITION BY and ORDER BY is specified the input data will always be spooled to enforce the PARTITON by grouping and ORDER BY ordering within the PARTITION. You can specify a PARTITON BY without an ORDER BY but you cannot have an ORDER BY without a PARTITION BY. Further the table operator will be called once for each partition and the row iterator will only be for the rows within the partition.  In summary a PARTITION is a logical concept and one or more partitions may be assigned to a AMP, same behavior as ordered analytic partitions.

The USING clause values are modeled as key-value pairs and can be accessed using the "FNC_TblOpGetCustomKey…" FNC interfaces. You can define multiple key-value pairs and a single key can have multiple values.  The using clause is a literal value and can not contain any expressions, DML etc. Further the values are handled by the syntaxer in a similar manner to regular SQL literals. For example {1 , 1.0 ,'1'} are respectively passed to the table operator as byteint, decimal(2,1) and VARCHAR(1) CHARACTER SET UNICODE values. The typical method to access the USING clause literal is to first determine the number of keys specified using "FNC_TblOpGetCustomKeyCount()" then to use "FNC_TblOpGetCustomKeyInfoOf(Key, &key)" to determine the metadata of a given key. Finally use "FNC_TblOpGetCustomValuesOf" to obtain the individual key values.

Contract Function Context. As noted the contract function can create a context data structure that will be passed to all AMPs in the system.  The context can be accessed using FNC_TblOpGetContractDef. The buffer where the contract context will be copied needs to be allocated before calling FNC_TblOpGetContractDef. You can call FNC_TblOpGetContractLength to obtain the length of the context.

Reading Row Values

The model for reading rows is that of an iterator. Let's describe input row access using the main body of the k-means example code which is copied below.

    ctx->Handle = (FNC_TblOpHandle_t *)FNC_TblOpOpen(0, 'r', 0);     
    ctx->OutHandle = (FNC_TblOpHandle_t*)FNC_TblOpOpen(0,'w',0);        

    while (FNC_TblOpRead(ctx->Handle) == TBLOP_SUCCESS)
    {
        /* Access clusterid column offset 0*/ 
        FNC_TblOpGetAttributeByNdx(
             ctx->Handle, 0, (void **)&ctx->clusterid, &null, &length);   
        /* Access key column offset 1 */
        FNC_TblOpGetAttributeByNdx(
             ctx->Handle, 1, (void **)&ctx->pkey, &null, &length);  

        if (*ctx->clusterid <= ctx->K) /* cluster definition rows*/
        {
            if (add_cluster(ctx) == ERROR) /* add row to cluster heap*/
                return;        
        }
        else  /* observations/data rows */
        {
            if (add_row(ctx) == ERROR)    /* add row to new cluster heap*/
               return;
        }
        ctx->rowcount++;
    }  

The first task is to open the input stream using FNC_TblOpOpen. This returns a handle to the stream to be used to read the values. The programmer than iterates over the rows by repeatedly calling FNC_TblOpRead passing in the row handle, each call will return the next row in the stream. When there are no more rows TBLOP_EOF will be returned. The iterator is the responsibility of the operator writer and the table operator itself is called only once. This reduces per row costs  and enables more flexible read and write patterns. To access the column metadata, for example names, data types etc, the programmer calls FNC_TblOpGetColDef.

To access individual column values FNC_TblOpGetAttributeXX is called which will return a pointer to the value and a NULL indicator. If the programmer desires they can also access the individual row column fields using "Handle->row->columnptr[] " structure. Internally the data is passed to the operator using the client side "indic data" format. This determines how the data types are represented.  The table operator is invoked once, or in the case of PARTITION BY once per PARTITION, on the AMP (always within a RET step). It therefore has complete control of the processing and therefore any allocated memory is persisted for the entire operator invocation. In addition the table operator can close and reopen the input stream if it was required to rescan the input rows.

It should be noted that the programmer model is the same in protected or NOT protected mode. In the case of protected mode the database builds a buffer of rows in the shared memory communication area. The buffering reduces the inter-process communication overheads and greatly reduces the performance difference between protected and not protected modes.

Writing Row Values

The first task is to open the output stream using FNC_TblOpOpen. This returns a handle to the stream to be used to write the values.  To access the column metadata, for example names, data types etc, the programmer calls FNC_TblOpGetColDef. To set individual column values the FNC_TblOpBindAttributeByNdx is called which will BIND the address of the output column value or set a NULL indicator. The bound memory must remain stable until FNC_TblOpWrite is called. If the programmer desires they can also set the individual row column fields using "Handle->row->columnptr[] " structure.   Rows are written to database spool using FNC_TblOpWrite.

Performance

The following example shows how to create and execute the k-means process. The input table is a 100,000,000 row table with 10 variables. The system utilized was a single node 1580 with 8 physical cores, 36 AMPs and 49GB of memory.

Create three users

test1: owner of procedure and table operator

test2: owner of the input table

test3: execute the process

As user test1 create the table operator and stored procedure.  The stored procedure uses a volatile table named results to hold logging information.

As user test2 create a 100,000,000 row input table with 10 columns

As user test3 execute the k-means stored procedure with input database "test2", input table "p1",  output cluster table "outc", no output data table and K=20 with a maximum of 5 iterations.

DBQL output for one iteration

Elapsed

(mins:secs)

Step

Step

Name

CPU

Secs

Row Count

Logical IO

Logical KB

Physical IO

Physical KB

CacheEffKB

    00:00.000000

1

Ctb  

0

36

180

8,192

0

0

100

    00:14.720000

2

RET  

219.3

100,000,720

964,465

100,707,308

9,414

1,883,740

98

    00:23.500000

3

RET  

354.8

720

189,496

23,708,500

972

31,104

100

    00:00.020000

4

SUM  

0.04

20

735

20,706

21

672

97

    00:00.020000

5

RET  

0.01

20

717

15,405

0

0

100

    00:00.030000

6

MRG  

0.02

20

588

19,764

0

0

100

    00:00.020000

7

Edt  

0

1

13

0

0

0

0

 

Details of key DBQL step records:

Step 2: implements the LOCAL ORDER BY clause, the output row count is based on 100,000,000 observations plus 720 cluster rows (20 clusters  * 36 AMPs).

Step 3: invokes the map phase table operator, the output row count is based on 720 cluster rows (20 clusters  * 36 AMPs)

Step 4: implements the reduce phase aggregation, the output row count is based on K=20 cluster rows

As mentioned in the overview section the FSG cache efficiency, the last column in the DBQL table, is very high for each iteration of the algorithm. This enables the process to operate in memory.

 

 

Ignore ancestor settings: 
0
Channel: 
Apply supersede status to children: 
0

Viewing all articles
Browse latest Browse all 780

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>