What are statistics and why do we need them?

Statistics are another area of SQL that seem to be fairly poorly understood, on what they are, how they are used and how they are maintained..

Statistics store information about the distribution of data in the columns. The information includes the density of the key columns which gives a measure of the uniqueness of the index, and a histogram that stores information on how the distribution of values within the column.

Creating statistics

When an index is created, statistics will automatically be created on the index columns. The stats on an index cannot be dropped while the index exists. If the index is a multi-column one, the density is stored for all columns, but the histogram is only stored for the leading column.

It’s also possible to have statistics that aren’t associated with indexes. These are created automatically by the query optimiser (providing auto_create_statistics is enabled) or they can be created manually with the CREATE STATISTICS statement.

It’s not usually necessary to create statistics manually as the optimiser usually knows best what stats are required.

Updating Statistics

SQL keeps a count of the number of changes to a column and will update the statistics when it deems sufficient changes to have occurred (providing auto_update_stats is enabled).The stats are not updates as part of a data modification. Rather the next select that requires those statistics will trigger the update. In most cases, the auto-update doesn’t cause problems. It runs in the read-uncommitted isolation level and hence won’t cause blocking and SQL will, by default, only sample a portion of the table to calculate the stats.

For larger tables, that threshold is 20% of the rows in the table. Sometimes, especially on ever-increasing columns (identity, date inserted), on larger tables that threshold is too high and the statistics can get very inaccurate before the auto update kicks in. In cases like that it’s sometimes necessary to manually update the statistics on a regular basis using the UPDATE STATISTICS statement.

The other problem with the auto update is that sometimes it’s not accurate enough. In cases where large tables have an uneven distribution of values in the column, the sampled update may miss values or may miscalculate the distribution. In cases like this, it may be necessary to manually update the statistics with a full scan, forcing SQL to evaluate the entire table. That option should be used with care and only where necessary as it may have a negative impact on the performance of the entire system

Finally, when indexes are rebuilt (not reorganised) the statistics on those indexes are updated with a full scan.

How are stats used

Statistics are used by the query optimiser to estimate how many rows each operator in the query plan will operate on. The row count affects the cost of the various operators, some work well on small rowsets, others work well on large row sets. With a good estimate of the row count, the optimiser can get a fairly accurate calculation of the cost of each operator and hence can work out what the cheapest plan is for the particular query.

Without accurate statistics, the optimiser is likely to miscalculate the cost of various query operators and hence miscalculate the cost of the entire plan. This can lead to very poorly performing queries

In a future post I’ll discuss how to identify out of date or inaccurate statistics.

10 Responses to “What are statistics and why do we need them?”

  1. Great overview, thanks.

    TPowell

  2. Good Article!!!
    Thanks

  3. good for having a general overview

  4. Great article very useful.

    Can i Get this document how to identify out of date or inaccurate statistics.

  5. great article and very helpful for knowledge about the processing on How to build a execute plan.

    Thank you for your selfness share.

  6. Good article, but the 20% is not accurate. SQL Server does not update stats if 20% rows of a table are modified. Its more complex than that.

  7. Unless you have the new adaptive stats threshold traceflag on, the threshold for triggering a stats update is when 20% + 500 changes have occurred to the column that the stats object is based off. That’s when there’s more than 500 rows in the table. Under 500 rows there’s just a couple hard-coded thresholds

  8. Gail, would it be a huge problem for you to update these articles with links to the next one once the next one is out? It would be very handy for such lazy people like I am :) Thanks in advance. Of course, I don’t have to tell you that I also find your blog very, very informative and much needed? Thanks for sharing your knowledge. Impressive.

  9. This one doesn’t really have a sequel, I didn’t ever get around to writing it.

  10. Good Article

Leave a Reply