Syndication

A Trio of Table Variables

So, it’s the second Tuesday of the month again, and it’s time for T-SQL Tuesday again. TSQL2sDay150x150This month it’s hosted by Sankar Reddy and the topic is “Misconceptions in SQL Server

I thought I’d tackle a trio of table variable myths and partial truths.

Table Variables are memory-only

This one is pervasive and irritating. It typically goes like this:

You should use table variables rather than temp tables because table variables are memory only.

This myth can be broken down into two parts:

  1. That table variables are not part of TempDB
  2. That table variables are not written to disk

The first is easy to prove and has been done repeatedly. I’m not doing it again. I’m going to tackle the second portion only.

See, one could argue that, even though the table variable is created in the TempDB system tables and allocated pages within the TempDB data file, it is still kept entirely and only in memory. Let’s see if that’s true…

DECLARE @LargeTable TABLE (
id INT IDENTITY PRIMARY KEY,
LargeStringColumn1 CHAR(100),
LargeStringColumn2 CHAR(100)
)

INSERT INTO @LargeTable (LargeStringColumn1, LargeStringColumn2)
SELECT TOP (100000) 'Table Variable Test','T-SQL Tuesday!'
FROM master.sys.columns a CROSS JOIN master.sys.columns b

WAITFOR DELAY '00:01:00' -- so that the table var doesn't go out of scope and get deallocated too quickly.

This is not a massively large table. 100000 rows at 204 bytes per row (excluding header). A query of sys.dm_db_index_physical_stats (which does work on temp tables and table variables) reveals a total page count of 2632. That’s a grand total of 20.6 MB. 20 Megabytes. The SQL instance I’m running this on is allowed to use up to 2 GB of memory. No way on earth is this table variable going to cause any form of memory pressure (and I promise there is nothing else running)

So, run that code and, while that waitfor is running, do something that should never be done to a SQL server that you care anything about. (more…)

One wide index or multiple narrow indexes?

TSQL2sDay150x150 Or “If one index is good, surely many indexes (indexes? indices? indi?) will be better

This is a question that comes up very often on the forums. Something along the lines of:

I have a query with multiple where clause conditions on a table. Should I create one index for each condition, or one index with all the columns in it?

The question basically boils down to this: Which is more optimal and more likely for the optimiser to pick, a single seek operation against a wide index that seeks on all three conditions in one go, or three seek operations against three indexes followed by a join to get back the final set of rows.

One thing to keep in mind is that one of the jobs of an index is to reduce the number of rows in consideration for a query as early as possible in the query’s execution.

So let’s take a made-up example. Let’s say we have a table with a number of columns in it. A query is run against that table with three conditions in the where clause

WHERE ColA = @A AND ColB = @B AND ColC = @C

Let’s further say that 1000 rows qualify for the condition ColA = @A, 15000 rows qualify for ColB = @B and 30000 rows qualify for ColC = @C. The total number of rows that qualify for all three conditions is 25.

Which sounds like it would be more efficient?

  • Seek on an index with all three columns and retrieve just 25 rows
  • Seek on an index on ColA, retrieve 1000 rows, seek on an index on ColB, retrieve 15000 rows, seek on an index on ColC, retrieve 30000 rows then join the three result-sets together to get the desired 25 rows (called an Index Intersection)

Time for some tests to find out.

(more…)

Come and gone

Or “Plan cache monitoring – insert and remove

Previously I took a look at the CacheHit and CacheMiss events to see how they behave and how to identify what’s been searched for in the cache. in this follow up, I want to take a similar look at the CacheInsert and CacheRemove events, see when they fire and how to identify the objects that they relate to.

Again, a word of caution, these can be frequently occurring events on busy servers and so traces should be kept short and to a minimum of events and columns. That said, these should occur a lot less often than the CacheHit and CacheMiss events. If they are occurring very frequently it may indicate that the SQL Server is not reusing plans efficiently.

CacheInsert

The CacheInsert event fires after a CacheMiss. The search for a matching plan in the cache failed, firing a CacheMiss event. Since there’s no plan, the optimiser is invoked to generate one and then that plan is inserted into the plan cache before the Query Execution engine begins execution.

The event is fairly simple, though of course there are a few surprises (what in SQL doesn’t have?).

Exec FireCacheEvents
GO

SELECT ID, SomeDate, Status
FROM TestingCacheEvents
WHERE Status =  'C'

CacheInsert

(more…)

Hit and miss

Or “Monitoring plan cache usage

For people interested in the details of how SQL is using and reusing execution plans, there are some useful events in profiler for watching this in detail, under the Stored procedure group:

  • SP:CacheMiss
  • SP:CacheInsert
  • SP:CacheHit
  • SP:CacheRemove
  • SP:Recompile
  • SP:StmtRecompile

Additionally there’s the SQL:StmtRecompile event under the TSQL group.

For now, I just want to look briefly at the CacheMiss and CacheHit events.

One word of caution early on, these are frequently occurring events and it may not be a good idea to trace these on busy production servers. If you do need to, keep the duration of the trace short and the columns to a minimum.

CacheEvents

CacheMiss

The cache miss event fires any time SQL looks for the execution plans for an object or ad-hoc batch and does not find it in the plan cache.

For an object (scalar function, multi-statement table-valued function, stored procedure or trigger) the match is done on the object ID (along with some of the connection’s SET options and possibly the database user and c couple other factors1). For an ad-hoc batch, the match is done on a hash of the text of the batch (along with some of the connection’s SET options and possibly the database user)

When testing stored procedures from Management Studio (or another SQL querying tool), two CacheMiss events will appear in the trace.

CacheMiss

What’s going on here?

(more…)

Does a missing data file send a database suspect?

Short answer: No

I keep seeing this come up on various forums when people ask what makes a database go suspect or what could have caused their database to be marked suspect. I can almost guarantee when that question comes up, one or more people will answer ‘missing files’. That may have been true on earlier versions (I don’t have a SQL 2000 or earlier instance to play with), but it is not true in the current versions of SQL. A missing file may result in the database being inaccessible (depending what file), but it will not result in a suspect database.

Let’s prove it.

I’m going to create a database with three files, two of which are in the primary filegroup, with one user table on each filegroup. (T-SQL code at the end) Before each test I’ll begin a transaction, modify data in both tables, shut SQL down so that there’s an uncommitted transaction in the log (database cannot be cleanly shut down), then rename a file before restarting SQL and see what happens.

I’m not going to play with the transaction log. That I’ve done before. In SQL 2005/2008, if the transaction log is missing and the database was cleanly shut down, SQL will recreate it. If the the transaction log is missing and the database was not cleanly shut down, the database goes into the RECOVERY_PENDING state, so no SUSPECT here.

Let’s try the file in the secondary filegroup first.

(more…)

DBA Hell

On the first day of DBA hell, the server gave to me
A database with damaged system tables and no good backups (1)

On the second day of DBA hell, the server gave to me
Two databases with widespread corruption, no backups (1, 2)

On the third day of DBA hell, the server gave to me
Three suspect databases, no backups (1,2,3)

On the fourth day of DBA hell, the server gave me nothing, cause I didn’t have a job any longer…

How does one end up with a critical production database that has no backups? I could kinda understand if the backups were damaged, if the corruption went undetected for long enough that it was in the backups as well, but to have no backups at all? Of an important database?

The only excuse for having no backups is if the database can be trivially and completely recreated from another source with minimal impact to the users. This is not the normal scenario.

There’s an immense amount of information available on backup and restore strategies.

That’s just a quick list, there’s far more information available than that. Enough that there’s really no good excuse to not have backups when they’re needed.

As Steve Jones (blog|twitter) is fond of saying “Good backup, good resume. You only need one”

In, Exists and join – a roundup

Over the last several months I’ve had a look at IN, Exists, Join and their opposites to see how they perform and whether there’s any truth in the advice that is often seen on forums and blogs advocating replacing one with the other.

Previous parts of this series can be found:

In this roundup post, I’m going to do multiple tests on the 6 query forms, with different numbers of rows, indexes, no indexes and, for the negative forms (NOT IN, NOT EXISTS), nullable and non-nullable join columns.

In the individual tests, I used 250000 rows in the first table and around 3000 rows in the secondary table. In this roundup, I’m going to use 3 different row counts, 1000000 rows, 50000 rows and 2500 rows. That should give a reasonable idea for performance at various table sizes. (Not much point in going smaller than 2500 rows. Everything’s fast on 100 rows)

Some notes on the tests.

  • The version of SQL is SQL Server 2008 SP1 x64 Developer Edition.
  • The tests were run on a laptop. Core-2 Duo, 3 GB memory. SQL limited to 1 processor, so no parallelism possible.
  • Each query will be run 10 times, reads, cpu and duration measured by profiler and averaged.
  • Each query will be run once before the tests start to ensure that the data is in cache and the execution plans are generated and cached.
  • Reproduction scripts will be available for download.

(more…)

Left outer join vs NOT EXISTS

And to wrap up the miniseries on IN, EXISTS and JOIN, a look at NOT EXISTS and LEFT OUTER JOIN for finding non-matching rows.

For previous parts, see

I’m looking at NOT EXISTS and LEFT OUTER JOIN, as opposed to NOT IN and LEFT OUTER JOIN, because, as shown in the previous part of this series, NOT IN behaves badly in the presence of NULLs. Specifically, if there are any NULLs in the result set, NOT IN returns 0 matches.

The LEFT OUTER JOIN, like the NOT EXISTS can handle NULLs in the second result set without automatically returning no matches. It behaves the same regardless of whether the join columns are nullable or not. Seeing as NULL does not equal anything, any rows in the second result set that have NULL for the join column are eliminated by the join and have no further effect on the query.

It is important, when using the LEFT OUTER JOIN … IS NULL, to carefully pick the column used for the IS NULL check. It should either be a non-nullable column (the primary key is a somewhat classical choice) or the join column (as nulls in that will be eliminated by the join)

Onto the tests

The usual test tables…

CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn char(4) NOT NULL,
Filler CHAR(100)
)

CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn char(4) NOT NULL,
SomeArbDate Datetime default getdate()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3918 row(s) affected)

First without indexes

-- Query 1
SELECT BigTable.ID, SomeColumn
	FROM BigTable LEFT OUTER JOIN SmallerTable ON BigTable.SomeColumn = SmallerTable.LookupColumn
	WHERE LookupColumn IS NULL

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

Let’s take a look at the execution plans

LeftOuterJoinNotIN_NotIndexed

(more…)

The Root of all Evil

Or “Shot gun query tuning

There have been a fair few forums questions in recent months asking for help in removing index scans, loop joins, sorts or other, presumed, slow query operators. There have been just as many asking how to replace a subquery with a join or a join with a subquery or similar aspects of a query usually for performance reasons.

The first question that I have to ask when looking at requests like that is “Why?”

Why is removing a particular query operator the goal? Why is changing a where clause predicate the goal? If it’s to make the query faster, has the query been examined and has it been confirmed that query operator or predicate really is the problem?

The title of this post refers to a comment I’ve seen again and again in blogs or articles about front-end development. “Premature optimisation is the root of all evils.” It’s true in the database field as well.

(more…)

NOT EXISTS vs NOT IN

Continuing with the mini-series on query operators, I want to have a look at NOT EXISTS and NOT IN.

Previous parts of this miniseries are:

Just one note before diving into that. The examples I’m using are fairly simplistic and that’s intentional. I’m trying to find what, if any, are the performance differences in a benchmark-style setup. I’ll have some comments on more complex examples in a later post.

The most important thing to note about NOT EXISTS and NOT IN is that, unlike EXISTS and IN,  they are not equivalent in all cases. Specifically, when NULLs are involved they will return different results. To be totally specific, when the subquery returns even one null, NOT IN will not match any rows.

The reason for this can be found by looking at the details of what the NOT IN operation actually means.

Let’s say, for illustration purposes that there are 4 rows in the table called t, there’s a column called ID with values 1..4

WHERE SomeValue NOT IN (SELECT AVal FROM t)

is equivalent to

WHERE (
SomeValue != (SELECT AVal FROM t WHERE ID=1)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=2)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=3)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=4)
)

Let’s further say that AVal is NULL where ID = 4. Hence that != comparison returns UNKNOWN. The logical truth table for AND states that UNKNOWN and TRUE is UNKNOWN, UNKNOWN and FALSE is FALSE. There is no value that can be AND’d with UNKNOWN to produce the result TRUE

Hence, if any row of that subquery returns NULL, the entire NOT IN operator will evaluate to either FALSE or UNKNOWN and no records will be returned

So what about EXISTS?

Exists cannot return NULL. It’s checking solely for the presence or absence of a row in the subquery and, hence, it can only return true or false. Since it cannot return NULL, there’s no possibility of a single NULL resulting in the entire expression evaluating to UNKNOWN.

Hence, when the column in the subquery that’s used for comparison with the outer table can have nulls in it, consider carefully which of NOT EXISTS or NOT IN you want to use.

Ok, but say there are no nulls in the column. How do they compare speed-wise. I’m going to do two tests, one where the columns involved in the comparison are defined as NULL and one where they are defined as NOT NULL. There will be no NULL values in the columns in either case. In both cases, the join columns will be indexed. After all, we all index our join columns, right?

So, first test, non-nullable columns. First some setup

CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn char(4) NOT NULL,
Filler CHAR(100)
)

CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn char(4) NOT NULL,
SomeArbDate Datetime default getdate()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3898 row(s) affected)

CREATE INDEX idx_BigTable_SomeColumn
ON BigTable (SomeColumn)
CREATE INDEX idx_SmallerTable_LookupColumn
ON SmallerTable (LookupColumn)

Then the queries

-- Query 1
SELECT ID, SomeColumn FROM BigTable
WHERE SomeColumn NOT IN (SELECT LookupColumn FROM SmallerTable)

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

The first thing to note is that the execution plans are identical.

ExecPlansNOTNULL

The execution characteristics are also identical.

Query 1
Table ‘BigTable’. Scan count 1, logical reads 342, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 8, physical reads 0.

SQL Server Execution Times:
CPU time = 156 ms,  elapsed time = 221 ms.

Query 2
Table ‘BigTable’. Scan count 1, logical reads 342, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 8, physical reads 0.

SQL Server Execution Times:
CPU time = 156 ms,  elapsed time = 247 ms.

So, at least for the case where the columns are defined as NOT NULL, these two perform the same.

What about the case where the columns are defined as nullable? I’m going to simply alter the two columns involved without changing anything else, then test out the two queries again.

ALTER TABLE BigTable
 ALTER COLUMN SomeColumn char(4) NULL

ALTER TABLE SmallerTable
 ALTER COLUMN LookupColumn char(4) NULL

And the same two queries

-- Query 1

SELECT ID, SomeColumn FROM BigTable
WHERE SomeColumn NOT IN (SELECT LookupColumn FROM SmallerTable)

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

And as for their performance…

ExecPlansNull

Query 1
Table ‘SmallerTable’. Scan count 3, logical reads 500011, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 437, physical reads 0.

SQL Server Execution Times:
CPU time = 827 ms,  elapsed time = 825 ms.

Query 2
Table ‘BigTable’. Scan count 1, logical reads 437, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 9, physical reads 0.

SQL Server Execution Times:
CPU time = 156 ms,  elapsed time = 228 ms.

Radically different execution plans, radically different performance characteristics. The NOT IN took over 5 times longer to execute and did thousands of times more reads.

Why is that complex execution plan required when there may be nulls in the column? I can’t answer that one, probably only one of the query optimiser developers can, however the results are obvious. When the columns allow nulls but has none, the NOT IN performs significantly worse than NOT EXISTS.

So, take-aways from this?

Most importantly, NOT EXISTS and NOT IN do not have the same behaviour when there are NULLs involved. Chose carefully which you want.

Columns that will never contain NULL values should be defined as NOT NULL so that SQL knows there will never be NULL values in them and so that it doesn’t have to produce complex plans to handle potential nulls.

On non-nullable columns, the behaviour and performance of NOT IN and NOT EXISTS are the same, so use whichever one works better for the specific situation.

One more to go on this: LEFT OUTER JOIN with the IS NULL check vs NOT IN