Performance

and other pointless query rewrites

In a similar vein to last week’s blog post… I heard an interesting comment recently. “Change that Column != 2 to a Column > 2 or Column < 2 combination, it can use indexes better.”

Sounds like something that clearly needs testing!

I’ll start with simple numbers table.

CREATE TABLE dbo.Numbers (
  Number INT NOT NULL
);

ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers
  PRIMARY KEY CLUSTERED (Number);

and put 1 million rows into it

INSERT INTO dbo.Numbers (Number)
SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))
  FROM msdb.sys.columns c CROSS JOIN msdb.sys.columns c2;

Let’s start, before we get into comparing things, with looking at the execution plan of a query with a != (or <>) operator.

SELECT Number
  FROM Numbers
  WHERE Number <> 12742; -- because 2 is on the first page of the index, and I don’t want any special cases here

InequalityExecutionPlan

That’s kinda complicated for a query with one table and one predicate. Let’s look at in pieces. The easiest place to start is the Clustered Index Seek. The seek predicate on the clustered index seek is

Seek Keys[1]: Start: [Test].[dbo].[Numbers].Number > Scalar Operator([Expr1009]), End: [Test].[dbo].[Numbers].Number < Scalar Operator([Expr1010])

Hmm…Looks like the parser/optimiser has already made our intended change for us. There’s some funky stuff in the top part of the plan, but what it’s essentially doing is generating two rows for the nested loop join, both with just the value that we’re excluding from the query, then the seek runs twice. I suspect that’s once for the less than 12742 and once for the greater than 12742 portions of the original predicate.

But, let’s do the full due diligence, the plan may not tell the whole story.

The performance numbers for the inequality form of the query, gathered via Extended Events and aggregated with Excel are:

Duration 122ms
CPU 105ms
Logical reads: 1619

This is our baseline, the numbers we’re comparing against. If the comment mentioned at the beginning is correct, then the revised query will have a significantly better performance.

The revised query is:

SELECT Number
  FROM Numbers
  WHERE Number > 12742 OR Number < 12742;

Execution plan is much simpler, no constant scans, no joins. Just a single index seek operation that executes once.

InequalityRevised

Is is better though?

Duration: 126ms
CPU: 103ms
Logical reads: 1619

No, it’s not.

Yes, we have a simpler plan, but we do not have a more efficient query. We’re still reading every page in the index and fetching all but one row of the table. The work required is the same, the performance characteristics are the same.

But, maybe, if the numbers aren’t unique and we’re excluding more than just one row it’ll be different.

That needs a slightly different table to test on.

CREATE TABLE MoreNumbers (
SomeNumber INT NOT NULL,
Filler CHAR(100)
);

CREATE CLUSTERED INDEX idx_MoreNumbers ON dbo.MoreNumbers(SomeNumber);
GO

INSERT INTO dbo.MoreNumbers (SomeNumber, Filler)
SELECT TOP (500000) NTILE(50) OVER (ORDER BY (SELECT 1)), ''
FROM msdb.sys.columns c CROSS JOIN msdb.sys.columns c2;

I’m just going to look at the performance characteristics this time. The execution plans are the same as for the earlier query. The two queries are:

SELECT * FROM dbo.MoreNumbers WHERE SomeNumber != 24;
SELECT * FROM dbo.MoreNumbers WHERE SomeNumber < 24 OR SomeNumber > 24;

Query 1:

Duration 97ms
CPU 77ms
Logical Reads 7624

Query 2:

Duration 98ms
CPU 75ms
Logical Reads 7624

Again identical.

Just like with the pointless WHERE clause predicate last week, we have a query change that has had no effect on the query performance. Now, to be honest, there are some query form changes that can improve performance. For example, converting a set of OR predicates to UNION can improve query performance sometimes (and leave it unchanged in others), and so these kinds of rewrites do need to be tested to see if they’re useful.

More importantly though, those of us who are posting on forums and advising others have a responsibility to do these tests before we recommend changes to others, as they may very well not do them. If we don’t, we’re propagating myths and not helping the advancement of our field.

On the addition of useless where clauses

I remember a forum thread from a while back. The question was on how to get rid of the index scan that was in the query plan. Now that’s a poor question in the first place, as the scan might not be a problem, but it’s the first answer that really caught my attention.

Since the primary key is on an identity column, you can add a clause like ID > 0 to the query, then SQL will use an index seek.

Technically that’s correct. If the table has an identity column with the default properties (We’ll call it ID) and the clustered index is on that identity column, then a WHERE clause of the form WHERE ID > 0 AND <any other predicates on that table> can indeed execute with a clustered index seek (although it’s in no way guaranteed to do so). But is it a useful thing to do?

Time for a made up table and a test query.

CREATE TABLE dbo.Orders(
  OrderID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
  OrderDate DATETIME2(7) NULL,
  ShipmentRef CHAR(10) NULL,
  ShipmentDate DATE NULL,
  Status VARCHAR(20) NOT NULL
);

That’ll do the job. And then a few hundred thousand rows via SQL Data Generator and we’re good to go.

And for a query that has a nasty index scan, how about

SELECT  OrderDate,
    ShipmentDate,
    Status
  FROM    dbo.Orders
  WHERE   LTRIM(RTRIM(Status)) = 'Delivered';

ClusteredIndexScan

Now, that’s running as a clustered index scan because the predicate’s not SARGable and besides, there’s no index on that column, but let’s pretend we don’t know that.

If I add a WHERE clause predicate that filters no row out, can I get a query plan with an index seek?

SELECT  OrderDate,
    ShipmentDate,
    Status
  FROM    dbo.Orders
  WHERE   LTRIM(RTRIM(Status)) = 'Delivered'
    AND OrderID > 0;

Why yes, I can.

ClusteredIndexSeek

Op Success? Well…

The goal of performance tuning is to improve the performance of a query, not to change operators in a query plan. The plan is a tool, not a goal.

Have we, by adding a WHERE clause predicate that filters out no rows, improved performance of the query? This needs an extended events session to answer. Nothing fancy, just a sql_statement_completed event will do the trick.

I ran each query 10 times, copied the captured events into Excel and averaged them:

Query with just the LTRIM(RTRIM(Status)) = ‘Delivered’
CPU: 77ms
Duration: 543ms

Query with LTRIM(RTRIM(Status)) = ‘Delivered’ AND OrderID > 0
CPU: 80ms
Duration: 550ms

We haven’t tuned that query. I won’t say we’ve made it slower either, the differences are well within the error range on our measuring, but there’s definitely no meaningful performance gain.

There’s no gain because we haven’t changed how the query executes. A scan, and in this case it will be a scan of the entire index, will likely use the non-leaf levels of the b-tree to locate the logical first page of the leaf level, then will read the entire leaf level. The seek we managed to generate will use the b-tree to find the value 0 in the clustered index key, that’s what makes it a seek. Since the column is an identity starting at 1, that means the first row read will be on the logical first page of the leaf level, then it will read the entire leaf level.

Both will do the same amount of work, and so we haven’t done anything useful to the query by adding a WHERE clause that filters out no rows.

Scans are not always bad. If a query needs to read every row of a table, that’s a scan and effort shouldn’t be expended trying to make it an index seek.

To improve the performance of a query, we need to make changes that reduce the work needed to run the query. That often starts with reducing the amount of data that the query reads, by changing the query so that it can use indexes effectively and/or adding indexes to support the query. Not by adding pointless pieces to a query just to change plan operators from ones that are believed to be bad to ones that are believed to be good. Doing that is just a waste of time and effort.

Natively compiled user-defined functions

One new thing that SQL Server 2016 has added is the ability to natively compile user-defined functions. Previously, native compilation, part of the hekaton feature, was limited to stored procedures.

When I saw that, the first question that came to mind is whether natively compiling a scalar function reduces the overhead when calling that function within another query. I’m not talking about data-accessing scalar UDFs, since natively compiled functions can only access in-memory tables, but functions that do simple manipulation of the parameters passed in. String formatting, for example, or date manipulation.

While not as harmful as data-accessing scalar UDFs, there’s still overhead as these are not inline functions, they’re called for each row in the resultset (as a look at the Stored Procedure Completed XE event would show), and the call to the function takes time. Admittedly not a lot of time, but when it’s on each row of a large resultset the total can be noticeable.

I decided to do a simple test. A query against a table with ~600k rows, one query with a traditional scalar function, one with a natively compiled function and one with the function’s contents in the query as a column.

The two functions:

CREATE FUNCTION dbo.DateOnly (@Input DATETIME)
RETURNS DATETIME
AS
BEGIN
  RETURN DATEADD(dd, DATEDIFF (dd, 0, @Input), 0);
END
GO

and

CREATE FUNCTION dbo.DateOnlyHekaton (@Input DATETIME)
RETURNS DATETIME
WITH NATIVE_COMPILATION, SCHEMABINDING
AS
BEGIN ATOMIC WITH (TRANSACTION ISOLATION LEVEL = SNAPSHOT, LANGUAGE = N'English')
  RETURN DATEADD(dd,DATEDIFF(dd,0,@Input),0);
END
GO

And the query without the function would be a simple

SELECT DATEADD(dd, DATEDIFF(dd, 0, SomeDate), 0)
FROM SomeTable;

Durations and CPU usage were caught with Extended Events. I ran each query 25 times and aggregated the results.

Average CPU (ms) Average Duration (ms)
In-line expression 289 294
Normal function 3555 3814
Natively Compiled Function 3318 3352

Not quite what I hoped. While the overhead of the natively compiled function is lower, it’s lower only by about 10%, which really is not worth it, now when we’re talking about an order of magnitude difference from the query without the function call.

Looks like the guidance is still going to be to not use scalar UDFs within other queries.

Q&A from the DBA Fundamentals Virtual Chapter

A couple of weeks ago I did a presentation to the DBA Fundamentals virtual chapter. The presentation title was “What execution plans can tell you about query performance”

The slides and recording are available at the Virtual Chapter’s home page

I didn’t manage to get all of the questions answered, so here are a couple of slightly more involved questions which didn’t get answered.

Does the order of table matter when doing an inner join?

Short answer: No.

Long answer: Maybe, but it shouldn’t.

The optimiser decides which table is joined in which order. Putting a table first in the join clause does not mean it will be the first one processed. In general (as in, in ~99% of cases), put the tables in the join clause in the order which makes logical sense for the query.

Changing table order can, in some cases, change the plan. This doesn’t mean that SQL uses the order which the tables are specified in to determine the plan, it just means that changing the query resulted in the optimiser searching through the plan search space in a different way and finding a different ‘good enough’ plan. It’s not going to be deterministic and hence shouldn’t be relied on.

Will moving a filter from the WHERE to the INNER JOIN improve performance?

No, but again it can change the plan generated as described above. Personally I prefer joins in the JOIN clause and filters in the WHERE clause, because that’s what’s normal and expected.

Please note that moving filters from/to the WHERE clause from an OUTER JOIN changes the logic of the query and likely the results.

If multiple users are running the same query with different parameter values, will it result in different plans or recompiles?

Neither.

There will be one plan in cache (unless the SET options differ, but let’s ignore that for now). No matter what the parameter values are, when the same query is run, the plan will be fetched from cache and used.

Does index fragmentation have an effect on the join type chosen?

The Query Optimiser has no idea what logical fragmentation is. It doesn’t base its choices on how the pages are laid out in the data file. Logical fragmentation affects large range scans from disk, that’s all. If the pages are in memory, then fragmentation has no further effect.

Converting OR to Union

When I looked at indexing for queries containing predicates combined with OR, it became clear that the are some restrictive requirements for indexes for the optimiser to consider using the indexes for seek operations.

  • Each predicate (or set of predicates) combined with an OR must have a separate index
  • All of those indexes must be covering, or the row count of the concatenated result set low enough to make key lookups an option, as the optimiser does not apparent to consider the possibility of doing key lookups for a subset of the predicates before concatenating the result sets.

So what can be done if it’s not possible to meet those requirements?

The standard trick is to convert the query with ORs into multiple queries combined with UNION. The idea is that since OR predicates are evaluated separately and the result sets concatenated, we can do that manually by writing the queries separately and concatenating them using UNION or UNION ALL. (UNION ALL can only be safely used if the predicates are known to be mutually exclusive)

CREATE TABLE Persons (
PersonID INT IDENTITY PRIMARY KEY,
FirstName    VARCHAR(30),
Surname VARCHAR(30),
Country CHAR(3),
RegistrationDate DATE
)

CREATE INDEX idx_Persons_FirstName ON dbo.Persons (FirstName) INCLUDE (Surname)
CREATE INDEX idx_Persons_Surname ON dbo.Persons (Surname) INCLUDE (FirstName)
GO

-- Data population using SQLDataGenerator

SELECT FirstName, Surname
FROM dbo.Persons
WHERE FirstName = 'Daniel' OR Surname = 'Barnes'

SELECT FirstName, Surname
FROM dbo.Persons
WHERE FirstName = 'Daniel'
UNION
SELECT FirstName, Surname
FROM dbo.Persons
WHERE Surname = 'Barnes'

In this case, the OR can be replaced with a UNION and the results are the same. The Union form is slightly less efficient according to the execution plan’s costings (60% compared to the OR at 40%), and the two queries have the same general form, with two index seeks and some form of concatenation and remove duplicates.

OrResult1
OrExecPlan1

So in that case it worked fine, although the original form was a little more efficient
(more…)

To TOP or not to TOP an EXISTS

Earlier this year I had a look at a query pattern that I often see on forums and in production code, that of the Distinct within an IN subquery. Today I’m going to look at a similar patters, that being the use of TOP 1 within an EXISTS subquery.

Three tests. First a straightforward exists with no correlation (no where clause linking it to an outer query). Second, an exists with a complex query (one with a non-sargable where clause and a group by and having). Third an exists subquery correlated to the outer query.

Table structures are nice and simple, in fact, for ease I’m going to use the same tables as I did back on the exists, in and inner join tests. Code to create and populate the tables it attached to the end of the post.

First up, a simple exists query, in an IF, just to be different.

IF EXISTS (SELECT 1 FROM PrimaryTable_Medium)
PRINT 'Exists'

IF EXISTS (SELECT TOP (1) 1 FROM PrimaryTable_Medium)
PRINT 'Exists too'

For a benchmark, a SELECT 1 FROM PrimaryTable_Medium has the following IO characteristics

Table ‘PrimaryTable_Medium’. Scan count 1, logical reads 89, physical reads 0.

SQL Server Execution Times:
CPU time = 15 ms,  elapsed time = 510 ms.

Ignore the elapsed time, that’s likely mostly from displaying the records. I’m going to focus mostly on the CPU and IO.

Execution plans of the two exists variations are absolutely identical.

TopExists1

The index operators are scans because there is no way they could be anything else, there’s no predicate so a seek is not possible. That said, it’s not a full index scan. The properties of the Index Scan show 1 row only (actual and estimated). So SQL did not read the entire index, just enough to evaluate the EXISTS, and that’s what it did in both cases. IO stats confirm that.

(more…)

Statistics, row estimations and the ascending date column

SQL’s auto-updating statistics go a fair way to making SQL Server a self-tuning database engine and in many cases they do a reasonably good job

However there’s one place where the statistics’ auto-update fails often and badly. That’s on the large table with an ascending column where the common queries are looking for the latest rows.

Let’s have a look at a common scenario.

We have a large table (imaginatively called ‘Transactions’) with a date time column (even more imaginatively called ‘TransactionDate’). This table gets about 80,000 new records a day and currently has around 8,000,000 records in it. So we can say roughly that another 1% is added to the table size daily. No records are updated and there’s a monthly purge of old data so that the total size remains about the same. A reasonably common real-life scenario.

CREATE TABLE Accounts (
AccountID INT IDENTITY PRIMARY KEY,
AccountNumber CHAR(8),
AccountType CHAR(2),
AccountHolder VARCHAR(50),
Filler CHAR(50) -- simulating other columns
)

CREATE TABLE Transactions (
TransactionID INT IDENTITY PRIMARY KEY NONCLUSTERED,
AccountID INT NOT NULL FOREIGN KEY REFERENCES Accounts (AccountID),
TransactionDate DATETIME NOT NULL DEFAULT GETDATE(),
TransactionType CHAR(2),
Amount NUMERIC(18,6),
Filler CHAR(150) -- Simulating other columns
)
GO
CREATE CLUSTERED INDEX idx_Transactions_TransactionDate
ON Transactions (TransactionDate)

CREATE NONCLUSTERED INDEX idx_Transactions_AccountID
ON Transactions (AccountID)

CREATE NONCLUSTERED INDEX idx_Accounts_AccountType
ON Accounts (AccountType)

-- Using RedGate's SQLDataGenerator to generate some data for this.

Accounts Transactions

Day 1 of the month, the indexes have just been rebuilt (after the data purge) and the statistics associated with those have been updated. The latest value in the TransactionDate column is ‘2011/01/31′ and the last value in the statistics histogram is ‘2011/01/31′. Life is good.

Day 2 of the month, there have been 80,000 new records added for the previous day. Only 1% of the table has been updated, so the automatic statistics update would not have triggered. The latest value in the TransactionDate column is ‘2011/02/01′ and the last value in the statistics histogram is ‘2011/01/31′. Doesn’t look like a problem.

Fast forwards another couple of days. Day 5 of the month. By this point 300,000 rows have been added since the beginning of the month. This amounts to around 5% of the table. Hence the statistics auto-update (triggered at 20%) still would not have run. The latest value in the TransactionDate column is ‘2011/02/04′ and the last value in the statistics histogram is ‘2011/01/31′. Starting to look less than ideal.

So, what kind of effect does this have on the queries against that table? (more…)

Are int joins faster than string joins?

This one comes up a lot on the forums, often as advice given…

“You should use integers for the key rather than strings. The joins will be faster.”

It sounds logical. Surely integers are easier to compare than large complex strings. Question is, is it true?

This is going to take lots and lots of rows to get any significant results. I’m going to do two sets of tests. The first comparing query execution speed between string and integer data types on the join column while keeping the size of the join column the same between the two. This is to tell if there’s a difference just because of the data type.

The second test will have both the data type and the size of the join columns differing, while the total size of the table row will be kept the same. This is to answer the ‘string joins are slower because they are larger’ argument.

Test 1: Same key size, no indexes

The two tables have the same size join column – a bigint in one and a char(8) in the other.

SELECT t1.ID, t2.ID, t1.IntForeignKey, t2.SomeArbStatus
FROM dbo.TestingJoinsInt t1
INNER JOIN dbo.LookupTableInt t2 ON t1.IntForeignKey = t2.ID
GO

SELECT t1.ID, t2.ID, t1.StrForeignKey, t2.SomeArbStatus
FROM dbo.TestingJoinsString t1
INNER JOIN dbo.LookupTableString t2 ON t1.StrForeignKey = t2.ID
GO

First up, a check of what Statistics IO and Statistics Time show for the two and whether there’s any difference in execution plan. (Full repo code is available for download, link at the end of the post)

Int joins

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsInt’. Scan count 1, logical reads 66036, physical reads 0.
Table ‘LookupTableInt’. Scan count 1, logical reads 735, physical reads 0.

SQL Server Execution Times:
CPU time = 2433 ms,  elapsed time = 32574 ms.

IntJoins1

String joins

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsString’. Scan count 1, logical reads 66036, physical reads 0.
Table ‘LookupTableString’. Scan count 1, logical reads 735, physical reads 0.

SQL Server Execution Times:
CPU time = 3744 ms,  elapsed time = 33947 ms.

StringJoins1

(more…)

Is a clustered index best for range queries?

I see a lot of advice that talks about the clustered index been the best index for use for range queries, that is queries with inequalities filters, queries that retrieve ranges of rows, as opposed to singleton queries, queries that retrieve single rows (including, unfortunately, a Technet article).

I suspect the reasoning behind this advice is the idea that the clustered index stores the data in order of the clustering key (ack) and hence it’s ‘logical’ that such a structure would be best for range scans as SQL can simply start at the beginning of the range and read sequentially to the end.

Question is, is that really the case?

Let’s do some experiments and find out.

CREATE TABLE TestingRangeQueries (
ID INT IDENTITY,
SomeValue NUMERIC(7,2),
Filler CHAR(500) DEFAULT ''
)

-- 1 million rows
INSERT INTO TestingRangeQueries (SomeValue)
SELECT TOP (1000000) RAND(CAST(a.object_id AS BIGINT) + b.column_id*2511)
FROM msdb.sys.columns a CROSS JOIN msdb.sys.columns b

-- One cluster and two nonclustered indexes on the column that will be used for the range filter

CREATE CLUSTERED INDEX idx_RangeQueries_Cluster
ON TestingRangeQueries (ID)

CREATE NONCLUSTERED INDEX idx_RangeQueries_NC1
ON TestingRangeQueries (ID)

CREATE NONCLUSTERED INDEX idx_RangeQueries_NC2
ON TestingRangeQueries (ID)
INCLUDE (SomeValue)
GO

The query that I’ll be testing with will do a sum of the SomeValue column for a large range of ID values. That means that of the three indexes that I’m testing, one is clustered, one is a nonclustered that does not cover the query and the third is a covering nonclustered index.

SELECT SUM(SomeValue)
FROM TestingRangeQueries
WHERE ID BETWEEN 20000 and 200000 -- 180 001 rows, 18% of the table

I’m going to run the same range scan query three times, each with an index hint so that SQL will use the three different indexes, regardless of which one it thinks is best.

First up, the clustered index.

As expected, we get a clustered index seek (the predicate is SARGable) and a stream aggregate.

ClusteredIndex

Table ‘TestingRangeQueries’. Scan count 1, logical reads 12023, physical reads 0.

SQL Server Execution Times:
CPU time = 94 ms,  elapsed time = 110 ms.

(more…)

Distincting an IN subquery

This is going to be a quick one…

I keep seeing forum code (and production code) that includes the DISTINCT in IN or EXISTS subqueries. The rationale is given either as a performance enhancement or as necessary for correct results.

Is it necessary or useful? Only one way to find out.

Let’s check for correct results first, because that can be done with nice small tables.

CREATE TABLE DistinctOuter (
ID INT
);

CREATE TABLE DistinctInner (
ID INT
);

INSERT INTO DistinctOuter
VALUES (1), (2), (3), (4), (5), (6), (7), (8)

INSERT INTO DistinctInner
VALUES (1), (2), (2), (2), (2), (4), (6), (7)

DistinctIN

(more…)