T-SQL

Converting OR to Union

When I looked at indexing for queries containing predicates combined with OR, it became clear that the are some restrictive requirements for indexes for the optimiser to consider using the indexes for seek operations.

  • Each predicate (or set of predicates) combined with an OR must have a separate index
  • All of those indexes must be covering, or the row count of the concatenated result set low enough to make key lookups an option, as the optimiser does not apparent to consider the possibility of doing key lookups for a subset of the predicates before concatenating the result sets.

So what can be done if it’s not possible to meet those requirements?

The standard trick is to convert the query with ORs into multiple queries combined with UNION. The idea is that since OR predicates are evaluated separately and the result sets concatenated, we can do that manually by writing the queries separately and concatenating them using UNION or UNION ALL. (UNION ALL can only be safely used if the predicates are known to be mutually exclusive)

CREATE TABLE Persons (
PersonID INT IDENTITY PRIMARY KEY,
FirstName    VARCHAR(30),
Surname VARCHAR(30),
Country CHAR(3),
RegistrationDate DATE
)

CREATE INDEX idx_Persons_FirstName ON dbo.Persons (FirstName) INCLUDE (Surname)
CREATE INDEX idx_Persons_Surname ON dbo.Persons (Surname) INCLUDE (FirstName)
GO

-- Data population using SQLDataGenerator

SELECT FirstName, Surname
FROM dbo.Persons
WHERE FirstName = 'Daniel' OR Surname = 'Barnes'

SELECT FirstName, Surname
FROM dbo.Persons
WHERE FirstName = 'Daniel'
UNION
SELECT FirstName, Surname
FROM dbo.Persons
WHERE Surname = 'Barnes'

In this case, the OR can be replaced with a UNION and the results are the same. The Union form is slightly less efficient according to the execution plan’s costings (60% compared to the OR at 40%), and the two queries have the same general form, with two index seeks and some form of concatenation and remove duplicates.

OrResult1
OrExecPlan1

So in that case it worked fine, although the original form was a little more efficient
(more…)

On Transactions, errors and rollbacks

Do errors encountered within a transaction result in a rollback?

It seems, at first, to be a simple question with an obvious answer. Transactions are supposed to be atomic, either the entire transaction completes or none of it completes.

Maybe too simple…

CREATE TABLE TestingTransactionRollbacks (
 ID INT NOT NULL PRIMARY KEY ,
 SomeDate DATETIME DEFAULT GETDATE()
 ) ;
GO
BEGIN TRANSACTION
-- succeeds
INSERT INTO TestingTransactionRollbacks (ID)
 VALUES (1)
-- Fails. Cannot insert null into a non-null column
INSERT INTO TestingTransactionRollbacks (ID)
 VALUES (NULL)
-- succeeds
INSERT INTO TestingTransactionRollbacks (ID)
 VALUES (2)
-- fails. Duplicate key
INSERT INTO TestingTransactionRollbacks (ID)
 VALUES (2)
-- succeeds
INSERT INTO TestingTransactionRollbacks (ID)
 VALUES (3)
COMMIT TRANSACTION
GO
SELECT ID, SomeDate FROM TestingTransactionRollbacks
GO
DROP TABLE TestingTransactionRollbacks

If a transaction rolled back at the first failure, that final select would return no rows. But it doesn’t, it returns 3 rows. The failure of the individual statements was ignored and the transaction completed and committed. If that had been an important business process, not a made-up example, that could have some nasty consequences for transactional consistency of data.

What’s really going on here? Aren’t transactions supposed to be atomic? Isn’t SQL supposed to roll them back if they don’t complete successfully?

Well, kinda.

(more…)

A Trio of Table Variables

So, it’s the second Tuesday of the month again, and it’s time for T-SQL Tuesday again. TSQL2sDay150x150This month it’s hosted by Sankar Reddy and the topic is “Misconceptions in SQL Server

I thought I’d tackle a trio of table variable myths and partial truths.

Table Variables are memory-only

This one is pervasive and irritating. It typically goes like this:

You should use table variables rather than temp tables because table variables are memory only.

This myth can be broken down into two parts:

  1. That table variables are not part of TempDB
  2. That table variables are not written to disk

The first is easy to prove and has been done repeatedly. I’m not doing it again. I’m going to tackle the second portion only.

See, one could argue that, even though the table variable is created in the TempDB system tables and allocated pages within the TempDB data file, it is still kept entirely and only in memory. Let’s see if that’s true…

DECLARE @LargeTable TABLE (
id INT IDENTITY PRIMARY KEY,
LargeStringColumn1 CHAR(100),
LargeStringColumn2 CHAR(100)
)

INSERT INTO @LargeTable (LargeStringColumn1, LargeStringColumn2)
SELECT TOP (100000) 'Table Variable Test','T-SQL Tuesday!'
FROM master.sys.columns a CROSS JOIN master.sys.columns b

WAITFOR DELAY '00:01:00' -- so that the table var doesn't go out of scope and get deallocated too quickly.

This is not a massively large table. 100000 rows at 204 bytes per row (excluding header). A query of sys.dm_db_index_physical_stats (which does work on temp tables and table variables) reveals a total page count of 2632. That’s a grand total of 20.6 MB. 20 Megabytes. The SQL instance I’m running this on is allowed to use up to 2 GB of memory. No way on earth is this table variable going to cause any form of memory pressure (and I promise there is nothing else running)

So, run that code and, while that waitfor is running, do something that should never be done to a SQL server that you care anything about. (more…)

In, Exists and join – a roundup

Over the last several months I’ve had a look at IN, Exists, Join and their opposites to see how they perform and whether there’s any truth in the advice that is often seen on forums and blogs advocating replacing one with the other.

Previous parts of this series can be found:

In this roundup post, I’m going to do multiple tests on the 6 query forms, with different numbers of rows, indexes, no indexes and, for the negative forms (NOT IN, NOT EXISTS), nullable and non-nullable join columns.

In the individual tests, I used 250000 rows in the first table and around 3000 rows in the secondary table. In this roundup, I’m going to use 3 different row counts, 1000000 rows, 50000 rows and 2500 rows. That should give a reasonable idea for performance at various table sizes. (Not much point in going smaller than 2500 rows. Everything’s fast on 100 rows)

Some notes on the tests.

  • The version of SQL is SQL Server 2008 SP1 x64 Developer Edition.
  • The tests were run on a laptop. Core-2 Duo, 3 GB memory. SQL limited to 1 processor, so no parallelism possible.
  • Each query will be run 10 times, reads, cpu and duration measured by profiler and averaged.
  • Each query will be run once before the tests start to ensure that the data is in cache and the execution plans are generated and cached.
  • Reproduction scripts will be available for download.

(more…)

Left outer join vs NOT EXISTS

And to wrap up the miniseries on IN, EXISTS and JOIN, a look at NOT EXISTS and LEFT OUTER JOIN for finding non-matching rows.

For previous parts, see

I’m looking at NOT EXISTS and LEFT OUTER JOIN, as opposed to NOT IN and LEFT OUTER JOIN, because, as shown in the previous part of this series, NOT IN behaves badly in the presence of NULLs. Specifically, if there are any NULLs in the result set, NOT IN returns 0 matches.

The LEFT OUTER JOIN, like the NOT EXISTS can handle NULLs in the second result set without automatically returning no matches. It behaves the same regardless of whether the join columns are nullable or not. Seeing as NULL does not equal anything, any rows in the second result set that have NULL for the join column are eliminated by the join and have no further effect on the query.

It is important, when using the LEFT OUTER JOIN … IS NULL, to carefully pick the column used for the IS NULL check. It should either be a non-nullable column (the primary key is a somewhat classical choice) or the join column (as nulls in that will be eliminated by the join)

Onto the tests

The usual test tables…

CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn char(4) NOT NULL,
Filler CHAR(100)
)

CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn char(4) NOT NULL,
SomeArbDate Datetime default getdate()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3918 row(s) affected)

First without indexes

-- Query 1
SELECT BigTable.ID, SomeColumn
	FROM BigTable LEFT OUTER JOIN SmallerTable ON BigTable.SomeColumn = SmallerTable.LookupColumn
	WHERE LookupColumn IS NULL

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

Let’s take a look at the execution plans

LeftOuterJoinNotIN_NotIndexed

(more…)

NOT EXISTS vs NOT IN

Continuing with the mini-series on query operators, I want to have a look at NOT EXISTS and NOT IN.

Previous parts of this miniseries are:

Just one note before diving into that. The examples I’m using are fairly simplistic and that’s intentional. I’m trying to find what, if any, are the performance differences in a benchmark-style setup. I’ll have some comments on more complex examples in a later post.

The most important thing to note about NOT EXISTS and NOT IN is that, unlike EXISTS and IN,  they are not equivalent in all cases. Specifically, when NULLs are involved they will return different results. To be totally specific, when the subquery returns even one null, NOT IN will not match any rows.

The reason for this can be found by looking at the details of what the NOT IN operation actually means.

Let’s say, for illustration purposes that there are 4 rows in the table called t, there’s a column called ID with values 1..4

WHERE SomeValue NOT IN (SELECT AVal FROM t)

is equivalent to

WHERE (
SomeValue != (SELECT AVal FROM t WHERE ID=1)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=2)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=3)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=4)
)

Let’s further say that AVal is NULL where ID = 4. Hence that != comparison returns UNKNOWN. The logical truth table for AND states that UNKNOWN and TRUE is UNKNOWN, UNKNOWN and FALSE is FALSE. There is no value that can be AND’d with UNKNOWN to produce the result TRUE

Hence, if any row of that subquery returns NULL, the entire NOT IN operator will evaluate to either FALSE or UNKNOWN and no records will be returned

So what about EXISTS?

Exists cannot return NULL. It’s checking solely for the presence or absence of a row in the subquery and, hence, it can only return true or false. Since it cannot return NULL, there’s no possibility of a single NULL resulting in the entire expression evaluating to UNKNOWN.

Hence, when the column in the subquery that’s used for comparison with the outer table can have nulls in it, consider carefully which of NOT EXISTS or NOT IN you want to use.

Ok, but say there are no nulls in the column. How do they compare speed-wise. I’m going to do two tests, one where the columns involved in the comparison are defined as NULL and one where they are defined as NOT NULL. There will be no NULL values in the columns in either case. In both cases, the join columns will be indexed. After all, we all index our join columns, right?

So, first test, non-nullable columns. First some setup

CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn char(4) NOT NULL,
Filler CHAR(100)
)

CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn char(4) NOT NULL,
SomeArbDate Datetime default getdate()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3898 row(s) affected)

CREATE INDEX idx_BigTable_SomeColumn
ON BigTable (SomeColumn)
CREATE INDEX idx_SmallerTable_LookupColumn
ON SmallerTable (LookupColumn)

Then the queries

-- Query 1
SELECT ID, SomeColumn FROM BigTable
WHERE SomeColumn NOT IN (SELECT LookupColumn FROM SmallerTable)

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

The first thing to note is that the execution plans are identical.

ExecPlansNOTNULL

The execution characteristics are also identical.

Query 1
Table ‘BigTable’. Scan count 1, logical reads 342, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 8, physical reads 0.

SQL Server Execution Times:
CPU time = 156 ms,  elapsed time = 221 ms.

Query 2
Table ‘BigTable’. Scan count 1, logical reads 342, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 8, physical reads 0.

SQL Server Execution Times:
CPU time = 156 ms,  elapsed time = 247 ms.

So, at least for the case where the columns are defined as NOT NULL, these two perform the same.

What about the case where the columns are defined as nullable? I’m going to simply alter the two columns involved without changing anything else, then test out the two queries again.

ALTER TABLE BigTable
 ALTER COLUMN SomeColumn char(4) NULL

ALTER TABLE SmallerTable
 ALTER COLUMN LookupColumn char(4) NULL

And the same two queries

-- Query 1

SELECT ID, SomeColumn FROM BigTable
WHERE SomeColumn NOT IN (SELECT LookupColumn FROM SmallerTable)

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

And as for their performance…

ExecPlansNull

Query 1
Table ‘SmallerTable’. Scan count 3, logical reads 500011, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 437, physical reads 0.

SQL Server Execution Times:
CPU time = 827 ms,  elapsed time = 825 ms.

Query 2
Table ‘BigTable’. Scan count 1, logical reads 437, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 9, physical reads 0.

SQL Server Execution Times:
CPU time = 156 ms,  elapsed time = 228 ms.

Radically different execution plans, radically different performance characteristics. The NOT IN took over 5 times longer to execute and did thousands of times more reads.

Why is that complex execution plan required when there may be nulls in the column? I can’t answer that one, probably only one of the query optimiser developers can, however the results are obvious. When the columns allow nulls but has none, the NOT IN performs significantly worse than NOT EXISTS.

So, take-aways from this?

Most importantly, NOT EXISTS and NOT IN do not have the same behaviour when there are NULLs involved. Chose carefully which you want.

Columns that will never contain NULL values should be defined as NOT NULL so that SQL knows there will never be NULL values in them and so that it doesn’t have to produce complex plans to handle potential nulls.

On non-nullable columns, the behaviour and performance of NOT IN and NOT EXISTS are the same, so use whichever one works better for the specific situation.

One more to go on this: LEFT OUTER JOIN with the IS NULL check vs NOT IN

IN vs INNER JOIN

Often in forum threads discussing query performance I’ll see people recommending replacing an INNER JOIN with an IN or EXISTS (or recommending replacing an IN or EXISTS with an INNER JOIN) for performance reasons. I’ve previously looked at how the IN and EXISTS compared, now I’m going to investigate and see how IN compares with the join.

One very important thing to note right off is that they are not equivalent in all cases.

An inner join between two tables does a complete join, it checks for matches and returns rows. This means, if there are multiple matching rows in the second table, multiple rows will be returned. Also, when two tables are joined, columns can be returned from either.  As a quick example:

DECLARE @BigTable TABLE (
 id INT IDENTITY PRIMARY KEY,
 SomeColumn CHAR(4),
 Filler CHAR(100)
)

Insert into @BigTable(SomeColumn) Values (1)
Insert into @BigTable(SomeColumn) Values (2)
Insert into @BigTable(SomeColumn) Values (3)
Insert into @BigTable(SomeColumn) Values (4)
Insert into @BigTable(SomeColumn) Values (5)


DECLARE @SomeTable TABLE (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM @BigTable b INNER JOIN @SomeTable  s ON b.SomeColumn = s.IntCol

This returns 7 rows and returns columns from both tables. Because the values in @SomeTable are duplicated, the matching rows from BigTable are returned twice.

With an IN, what is done is a semi-join, a join that checks for matches but does not return rows. This means if there are multiple matching tables in the resultset used for the IN, it doesn’t matter. Only one row from the first table will be returned. Also, because the rows are not returned, columns from the table referenced in the IN cannot be returned. As a quick example

DECLARE @BigTable TABLE (
 id INT IDENTITY PRIMARY KEY,
 SomeColumn CHAR(4),
 Filler CHAR(100)
)

Insert into @BigTable(SomeColumn) Values (1)
Insert into @BigTable(SomeColumn) Values (2)
Insert into @BigTable(SomeColumn) Values (3)
Insert into @BigTable(SomeColumn) Values (4)
Insert into @BigTable(SomeColumn) Values (5)

DECLARE @SomeTable TABLE (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM @BigTable
 WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

This returns 5 rows and only columns from BigTable.

So, that said, how does the performance of the two differ for the cases where the results are identical (no duplicates in the second table, no columns needed from the second table)? For that, I’m going to need larger tables to play with. (more…)

EXISTS vs IN

This one’s come up a few times recently, so I’ll take a look at it.

The EXISTS and IN clauses at first glance look fairly similar. They both use a subquery to evaluate rows, but they do it in a slightly different way

IN does a direct match between the column specified before the IN keyword and the values returned by the subquery. When using IN there can only be a single column specified in the select clause of the subquery

Let’s have a look at a quick example

DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
FROM BigTable
WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

So this query returns all the rows from BigTable where SomeColumn has any of the values returned by the subquery, that is 1,2,3,4 or 5

But what if there were duplicate values returned by the subquery?

(more…)

Functions, IO statistics and the Execution plan

It’s no secret that I’m not overly fond of most user-defined functions. This isn’t just a pet hate, I have some good reasons for disliking them. All too often they’re performance bottlenecks, but that can be said about many things in SQL. The bigger problem is that they’re hidden performance bottlenecks that often go overlooked and ignored for too long.

I’m going to start with this fairly simple scalar function, created in the AdventureWorks database

Create function LineItemTotal(@ProductID int)
returns money
as
begin
declare @Total money

select @Total = sum(LineTotal) from sales.SalesOrderDetail where productid = @ProductID

return @Total
end

So, given that function, the following two queries should be equivalent.

SELECT productid, productnumber, dbo.LineItemTotal(productid) as SumTotal
FROM Production.Product p

SELECT productid, productnumber,
(select sum(LineTotal) from sales.SalesOrderDetail where productid = p.productid) AS SumTotal
FROM Production.Product p

No problems so far. They both return 504 rows (in my copy of AW, which has been slightly padded out with more data). Now, let’s look at the execution characteristics by running them again with Statistics IO and Statistics Time on.

Query 1, the one with the scalar function:

Table ‘Product’. Scan count 1, logical reads 4, physical reads 0.

SQL Server Execution Times:
CPU time = 47297 ms,  elapsed time = 47541 ms.

Query 2, the one with the correlated subquery:

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘SalesOrderDetail’. Scan count 3, logical reads 22536, physical reads 0.
Table ‘Product’. Scan count 3, logical reads 40, physical reads 0.

SQL Server Execution Times:
CPU time = 1047 ms, elapsed time = 1249 ms.

(more…)

On Counts

Or “What’s the fastest way to count the rows?”

It’s a fairly common to need to know the number of rows in a table, the number of rows that match a certain condition or whether or not there are any rows that match a condition. There’s also a number of ways of doing so, some better than others. The problem being that counting is not a cheap operation, especially on big tables. It’s not as bad as a sort, but it still can be expensive.

So, given that, let’s take a look at some of the ways.

Querying the metadata

If all that’s needed is the number of rows in the table, and it’s not 100% important that the value be completely accurate all the time, the system metadata can be queried. In SQL 2000 and below, that info was in sysindexes. In 2005 and higher it’s been moved into sys.partitions.

SELECT OBJECT_NAME(object_id) AS TableName, SUM(rows) AS TotalRows
FROM sys.partitions
WHERE index_id in (0,1)
AND object_id = OBJECT_ID('TableName')
GROUP BY object_id

The advantage of this approach is that it is fast. Since it’s not actually counting anything and, in fact, isn’t even accessing the table that’s being counted, it’s the fastest way to get the count of rows in the table.

The disadvantage is it can only get the number of rows in the table and cannot consider any criteria at all. It also may not be 100% accurate, depending how and when the table’s rowcount metadata is updated by the SQL engine. (more…)