What are statistics and why do we need them?

Statistics are another area of SQL that seem to be fairly poorly understood, on what they are, how they are used and how they are maintained..

Statistics store information about the distribution of data in the columns. The information includes the density of the key columns which gives a measure of the uniqueness of the index, and a histogram that stores information on how the distribution of values within the column.

Creating statistics

When an index is created, statistics will automatically be created on the index columns. The stats on an index cannot be dropped while the index exists. If the index is a multi-column one, the density is stored for all columns, but the histogram is only stored for the leading column.

It’s also possible to have statistics that aren’t associated with indexes. These are created automatically by the query optimiser (providing auto_create_statistics is enabled) or they can be created manually with the CREATE STATISTICS statement.

It’s not usually necessary to create statistics manually as the optimiser usually knows best what stats are required.

Updating Statistics

SQL keeps a count of the number of changes to a column and will update the statistics when it deems sufficient changes to have occurred (providing auto_update_stats is enabled).The stats are not updates as part of a data modification. Rather the next select that requires those statistics will trigger the update. In most cases, the auto-update doesn’t cause problems. It runs in the read-uncommitted isolation level and hence won’t cause blocking and SQL will, by default, only sample a portion of the table to calculate the stats.

For larger tables, that threshold is 20% of the rows in the table. Sometimes, especially on ever-increasing columns (identity, date inserted), on larger tables that threshold is too high and the statistics can get very inaccurate before the auto update kicks in. In cases like that it’s sometimes necessary to manually update the statistics on a regular basis using the UPDATE STATISTICS statement.

The other problem with the auto update is that sometimes it’s not accurate enough. In cases where large tables have an uneven distribution of values in the column, the sampled update may miss values or may miscalculate the distribution. In cases like this, it may be necessary to manually update the statistics with a full scan, forcing SQL to evaluate the entire table. That option should be used with care and only where necessary as it may have a negative impact on the performance of the entire system

Finally, when indexes are rebuilt (not reorganised) the statistics on those indexes are updated with a full scan.

How are stats used

Statistics are used by the query optimiser to estimate how many rows each operator in the query plan will operate on. The row count affects the cost of the various operators, some work well on small rowsets, others work well on large row sets. With a good estimate of the row count, the optimiser can get a fairly accurate calculation of the cost of each operator and hence can work out what the cheapest plan is for the particular query.

Without accurate statistics, the optimiser is likely to miscalculate the cost of various query operators and hence miscalculate the cost of the entire plan. This can lead to very poorly performing queries

In a future post I’ll discuss how to identify out of date or inaccurate statistics.

13 Comments

Tom Powell 11 February 2009 at 18:27

Great overview, thanks.

TPowell

Reply ↓
Tobin 23 November 2009 at 06:48

Good Article!!!
Thanks

Reply ↓
pouyan 13 January 2010 at 11:39

good for having a general overview

Reply ↓
Koteswarrao 20 July 2010 at 20:53

Great article very useful.

Can i Get this document how to identify out of date or inaccurate statistics.

Reply ↓
David 27 July 2012 at 06:54

great article and very helpful for knowledge about the processing on How to build a execute plan.

Thank you for your selfness share.

Reply ↓
John 11 October 2012 at 20:51

Good article, but the 20% is not accurate. SQL Server does not update stats if 20% rows of a table are modified. Its more complex than that.

Reply ↓
Gail (Post author)15 October 2012 at 16:14

Unless you have the new adaptive stats threshold traceflag on, the threshold for triggering a stats update is when 20% + 500 changes have occurred to the column that the stats object is based off. That’s when there’s more than 500 rows in the table. Under 500 rows there’s just a couple hard-coded thresholds

Reply ↓
Darlove 18 October 2012 at 17:54

Gail, would it be a huge problem for you to update these articles with links to the next one once the next one is out? It would be very handy for such lazy people like I am 🙂 Thanks in advance. Of course, I don’t have to tell you that I also find your blog very, very informative and much needed? Thanks for sharing your knowledge. Impressive.

Reply ↓
Gail (Post author)19 October 2012 at 20:10

This one doesn’t really have a sequel, I didn’t ever get around to writing it.

Reply ↓
madhuri 27 February 2013 at 12:17

Good Article

Reply ↓
Jayprakash 25 May 2014 at 14:42

Informative Article

Reply ↓
Pingback: SQL Server Index Column Order - SQL Server - SQL Server - Toad World
ANCY NOEL 28 July 2024 at 13:56

highly recommended for anyone who wants to get the basics strong.

Reply ↓

What are statistics and why do we need them?

Creating statistics

Updating Statistics

How are stats used

Share this:

13 Comments

Leave a Comment Cancel reply