The accuracy of sample statistics has been greatly improved over the last Teradata releases. So I usually try to use sample stats on most of the big tables and I found them to be reliable on many columns, not only the officially recommended unique or nearly-unique columns, e.g. on DATEs.
But there is a specific scenario when sample stats result in worst case optimizer plans:
The table is partitioned and there’s a dependency between the partitioning and the sampled column.
Background
The process used for sampling stats is not the same as SAMPLE in a SELECT, which is a truly random sample. It’s similar to a TOP, i.e. it simply starts scanning the first percents of the table. This would result in wrong data when the stats are on the partitioning column, so the optimizer is smart enough to recognize that and switches to scan the first percent of each partition.
Now consider following scenario:
CREATE MULTISET TABLE statstest(dt DATE, yearmonth INT) PRIMARY INDEX (dt) PARTITION BY RANGE_N(dt BETWEEN DATE '2010-01-01' AND DATE '2020-12-31' EACH INTERVAL '1' DAY); INSERT INTO statstest SELECT calendar_date ,EXTRACT(YEAR FROM calendar_date) * 100 +EXTRACT(MONTH FROM calendar_date) FROM sys_calendar.calendar WHERE calendar_date BETWEEN DATE '2010-01-01' AND CURRENT_DATE SAMPLE WITH REPLACEMENT 100000; COLLECT STATS USING SAMPLE 2 PERCENT COLUMN(yearmonth) ON statstest; SELECT MIN(yearmonth), MAX(yearmonth), COUNT(DISTINCT yearmonth) FROM statstest; 201001 201410 58 SHOW STATS VALUES COLUMN(yearmonth) ON stats test; ... /* MinVal */ 201001, /* MaxVal */ 201002, /* ModeVal */ 201001, ... /* NumOfDistinctVals */ 2,
Ops, there are 58 months but according to the stats there are only two.
Why?
monthyear is directly correlated to dt and because all rows within a table are sorted by partition (in this case by date) only the oldes dates are fetched.
Any query with a WHERE-condition based on yearmonth outside of the estimated range might result in a really bad plan. For equality the optimizer will assume the average number of rows (50,000) but a BETWEEN will result in estimated with high confidence to be 1 row. You can imagine the performance of such a plan.
Solution
To solve this problem switch to full stats whenever you need to collect stats an a dependent column.
Caution: In TD14.10 the optimizer might automatically switch to sample stats causing this problem to appear a few weeks delayed. In that case better force full stats using the NO SAMPLE option.