When sample statistics go wild

The accuracy of sample statistics has been greatly improved over the last Teradata releases. So I usually try to use sample stats on most of the big tables and I found them to be reliable on many columns, not only the officially recommended unique or nearly-unique columns, e.g. on DATEs.

But there is a specific scenario when sample stats result in worst case optimizer plans:

The table is partitioned and there’s a dependency between the partitioning and the sampled column.

Background

The process used for sampling stats is not the same as SAMPLE in a SELECT, which is a truly random sample. It’s similar to a TOP, i.e. it simply starts scanning the first percents of the table. This would result in wrong data when the stats are on the partitioning column, so the optimizer is smart enough to recognize that and switches to scan the first percent of each partition.

Now consider following scenario:

CREATE MULTISET TABLE statstest(dt DATE, yearmonth INT)
PRIMARY INDEX (dt) 
PARTITION BY RANGE_N(dt BETWEEN DATE '2010-01-01' AND DATE '2020-12-31' EACH INTERVAL '1' DAY);

INSERT INTO statstest 
SELECT 
   calendar_date
  ,EXTRACT(YEAR FROM calendar_date) * 100
  +EXTRACT(MONTH FROM calendar_date)
FROM sys_calendar.calendar 
WHERE calendar_date BETWEEN DATE '2010-01-01' AND CURRENT_DATE
SAMPLE WITH REPLACEMENT 100000;

COLLECT STATS USING SAMPLE 2 PERCENT COLUMN(yearmonth) ON statstest;

SELECT MIN(yearmonth), MAX(yearmonth), COUNT(DISTINCT yearmonth) FROM statstest;
201001    201410    58

SHOW STATS VALUES COLUMN(yearmonth) ON stats test;
...
 /* MinVal                */ 201001, 
 /* MaxVal                */ 201002, 
 /* ModeVal               */ 201001, 
...
 /* NumOfDistinctVals     */ 2,

Ops, there are 58 months but according to the stats there are only two.

Why?

monthyear is directly correlated to dt and because all rows within a table are sorted by partition (in this case by date) only the oldes dates are fetched.

Any query with a WHERE-condition based on yearmonth outside of the estimated range might result in a really bad plan. For equality the optimizer will assume the average number of rows (50,000) but a BETWEEN will result in estimated with high confidence to be 1 row. You can imagine the performance of such a plan.

Solution

To solve this problem switch to full stats whenever you need to collect stats an a dependent column.

Caution: In TD14.10 the optimizer might automatically switch to sample stats causing this problem to appear a few weeks delayed. In that case better force full stats using the NO SAMPLE option.

Tags:

collect statistics

random amp sample

Ignore ancestor settings:

Apply supersede status to children:

When sample statistics go wild

Background

Solution

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Demi Lovato – Tell Me You Love Me (Remixes) – 2018 – iTunes Plus AAC M4A – EP

Bureau of Internal Revenue: Regional Offices (Directory)

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

NY-PHIL Mafia’s “Peter Pan” Tuccio Got A Beat Down For Being Disrespectful To...

[GET] Jenna Kutcher – The Instagram Lab 2.0 ($297.00)

[アメリカドラマ][WEBDL] ナルコワールド麻薬取引の実態全4話

99 Rain Status for Whatsapp - Best Rain Dp Collection

DJ Snake – Encore [iTunes Plus M4A]

0014368: Detected CPU family 6 model 158

Black Angus Grilled Artichokes

Return To Forever – Musicmagic (1977) [Audio Fidelity 2016] {SACD ISO + FLAC...

The 10 Tennessee Cities With The Largest Black Population For 2021

Moondru Mudichu 16-05-2017 – Polimer tv Serial

New Guidelines for settlement of Medical claims of pensioners and others in...

Cris MJ – Apocalipsis [iTunes Plus M4A]

Cecil Smith Has Taken His Life, After Being the Subject of Conspiracy...

Maryland: State Police report DUI arrests for Aug. 16th – 31st 2015; beer and...

RE: Same voucher no. with different dates in AX 2009

Project could not be loaded, it may be damaged or contain outdated elements