What are statistics and why do we need them?

Statistics are another area of SQL that seem to be fairly poorly understood, on what they are, how they are used and how they are maintained..

Statistics store information about the distribution of data in the columns. The information includes the density of the key columns which gives a measure of the uniqueness of the index, and a histogram that stores information on how the distribution of values within the column.

Creating statistics

When an index is created, statistics will automatically be created on the index columns. The stats on an index cannot be dropped while the index exists. If the index is a multi-column one, the density is stored for all columns, but the histogram is only stored for the leading column.

It’s also possible to have statistics that aren’t associated with indexes. These are created automatically by the query optimiser (providing auto_create_statistics is enabled) or they can be created manually with the CREATE STATISTICS statement.

It’s not usually necessary to create statistics manually as the optimiser usually knows best what stats are required.

Updating Statistics

SQL keeps a count of the number of changes to a column and will update the statistics when it deems sufficient changes to have occurred (providing auto_update_stats is enabled).The stats are not updates as part of a data modification. Rather the next select that requires those statistics will trigger the update. In most cases, the auto-update doesn’t cause problems. It runs in the read-uncommitted isolation level and hence won’t cause blocking and SQL will, by default, only sample a portion of the table to calculate the stats.


Deadlock Graphs

A deadlock can appear to be a very obtuse error, as the only advice offered with the error message is an instruction to ‘rerun the transaction’, and there’s no apparent way to find the cause of the deadlock.

There are in fact several ways to get information on what caused the deadlock and on what resources the deadlock occurred. This information is generally referred to as the ‘deadlock graph’

Traceflag 1204

This is the oldest of the ways to get at the deadlock graph and it’s the main way used on SQL 2000. If traceflag 1204 is enabled server-wide, either by adding it as a startup parameter or by using the TraceOn command (DBCC TRACEON(1204,-1)), then anytime the SQL deadlock detector detects and resolves a deadlock, the information on that is written into the SQL Server error log. It looks something like this:

[source:XML] Deadlock encountered …. Printing deadlock information
Wait-for graph


OBJECT: 12:629577281:0         CleanCnt:3 Mode:S Flags: 0x1
Grant List 2:
Owner:0x000000008673B900 Mode: S        Flg:0x40 Ref:2 Life:02000000 SPID:52 ECID:0 XactLockInfo: 0x0000000085F64ED0
SPID: 52 ECID: 0 Statement Type: UPDATE Line #: 11
Input Buf: Language Event: exec CauseDeadlock
Requested by:
ResType:LockOwner Stype:’OR’Xdes:0x0000000085F65740 Mode: IX SPID:53 BatchID:0 ECID:0 TaskProxy:(0x0000000085F5C538) Value:0x85b66a00 Cost:(0/0)


OBJECT: 12:629577281:0         CleanCnt:3 Mode:S Flags: 0x1
Grant List 2:
Owner:0x00000000867FA180 Mode: S        Flg:0x40 Ref:2 Life:02000000 SPID:53 ECID:0 XactLockInfo: 0x0000000085F65780
SPID: 53 ECID: 0 Statement Type: UPDATE Line #: 11
Input Buf: Language Event: exec CauseDeadlock
Requested by:
ResType:LockOwner Stype:’OR’Xdes:0x0000000085F64E90 Mode: IX SPID:52 BatchID:0 ECID:0 TaskProxy:(0x00000000810B8538) Value:0x85b66a40 Cost:(0/0)

Victim Resource Owner:
ResType:LockOwner Stype:’OR’Xdes:0x0000000085F65740 Mode: IX SPID:53 BatchID:0 ECID:0 TaskProxy:(0x0000000085F5C538) Value:0x85b66a00 Cost:(0/0) [/source]


Comparing date truncations

Removing the time from a datetime is a task that comes up again and again and it sometimes seems like everyone’s got their own favorite way of doing it. I though I’d take a run through some of the ways I’ve seen and see how the compare in terms of performance.

The methods that I’ll be comparing are:

  1. Cast to varchar and back to datetime
  2. Cast to float, floor and cast back to datetime
  3. Use dateadd and datediff to get rid of the time
  4. Cast to date and back to datetime (SQL 2008 specific)

I ran each one 4 times and averaged the results


INSERT INTO #DateTest (DateValue)
SELECT TOP 1000000
DATEADD(mi,RAND(s1.number*5000 + s2.number)*5000,'2000/01/01' )
FROM master..spt_values s1 CROSS JOIN master..spt_values s2
WHERE s1.[name] IS NULL AND s2.NAME IS null


SELECT dateadd(dd, datediff(dd,0, DateValue),0)
FROM #DateTest
GO 4

SELECT convert(datetime,convert(varchar(10),DateValue,101))
FROM #DateTest
GO 4

FROM #DateTest
GO 4

FROM #DateTest
GO 4


And so, the winner of the date challenge:

  1. Dateadd and datediff – CPU time 348ms, execution time 8092 ms
  2. Cast to float and floor – CPU time 395ms, execution time 8270ms
  3. Cast to varchar – CPU time 1450ms, execution time 9280ms
  4. Cast to Date – CPU time 3063ms, execution time 7905ms

I was not expecing the cast to Date to take so much processing time. I think I’m going to take a look at the new date types in more detail, see what they perform like.

Views or Functions?

Someone asked this question on SQLServerCentral, I thought I’d post the test here for interest.

The question was on which would execute faster, creating a view, then creating a procedure that accepted a parameter and filtered the view, or creating a table valued function that took a parameter.

From testing, the answer is the procedure, but not by much. A inline table-valued-function comes a very close second and the multi-statement table-valued-function waddles in last. Not really surprising. (more…)

Parameter sniffing, pt 3

Last, but not least, here’s one final look at parameter sniffing.

In part 1 of this mini-series I wrote about data skew and the problems it can cause with parameters. In part 2, I looked at what happens when the optimiser can’t sniff values. In this post, I’m going to look at a third thing that can cause parameter sniffing problems. It’s fairly obvious once you know about it, but the first time it was pointed out to me I was a little surprised.

So, to recap. When the query optimiser gets a stored procedure to compile, it knows the values of any parameter passed to that procedure. It will then compile the procedure and optimise the queries based upon the value of those parameters. The optimiser cannot sniff the values of variables, because the values of the variables have not been set at the time that the procedure is optimised.

I’m going to use the same sample code and data as in the first article, to keep things consistent.

From the tests that were done before, I know that the query

select * from largetable where somestring = 'zzz'

executes optimally with an index seek and returns 9 rows. Likewise, I know that the query

select * from largetable where somestring = 'abc'

executes optimally with a clustered index scan and returns 1667 rows.

Now, let’s see if I can get the optimiser to make the wrong choice.


Parameter sniffing, pt 2

A while back I wrote about parameter sniffing, the situation where SQL compiles and caches an execution plan that is appropriate for a certain value of a parameter, but is non optimal for other values. There’s another side to parameter sniffing though – when the optimiser can’t sniff at all.

When a batch is submitted to the optimiser, the value of any parameters (from stored procedure, auto parametrisation or sp_executesql) are known to the optimiser. So are the value of any constants used in the SQL statement. The value of local variables, however, are not.

When the value of a variable used in the where clause is not known, the optimiser has know idea what value to use to estimate the number of affected rows. Hence, it has no idea how many rows from the table will satisfy the condition.


Parameter sniffing

This seems to come up again and again on the forums.

At its heart, parameter sniffing is the ability of the SQL Server optimiser to know the values of parameters passed to a stored proc at the point that it compiles the procedure. The idea is that if the parameter values are known, then the appropriate column statistics can be used and the optimiser can estimate the number of rows that the various query operators will have to process for various different possible execution plans.

Since the approximate number of rows is known, the cost of each possible plan can be more accurately calculated and a more accurate execution plan can be selected.

So, why is parameter sniffing so often a problem? Well, mainly, because parameter values do change.


Implicit conversions

Or ‘How to slow down a query without apparent reason’

I’ve discussed functions in where clauses before and shown how they can prevent index usage and slow down queries. I’ve also discussed data type conversions and shown that SQL Server sometimes does the conversions in places not expected. There’s a nice little gotcha that results from the combination of these two. Let’s take a look at a simple example.

Which of the following queries will run slower? (sample code at the end)

Select ID from TestingConversion where Code = 'AAA'
Select ID from TestingConversion where Code = N'AAA'


Functions in a where clause

Or ‘How to really slow a query down without trying’

Here’s a query that looks innocent enough. Table structure and sample data are at the end of the post.

-- returns 5 out of 5000 rows
SELECT InsertDate FROM TestingFunction WHERE LEFT(Code,1)='AA'

Knowing that there’s an index on the column code, the optimiser might be expected to use an index seek to satisfy the query. However, the execution plan shows an index scan. Why?