Author Archive: Gail

Running SQL faster

Poor performance is not uncommon in most SQL Server environments. The data sizes are growing, the hardware isn’t and traditional methods of performance tuning are time-consuming and difficult.

So what’s the solution? Well, throwing hardware at the problem is an old favourite. There are few workloads that a nice 256-processor Itanium with a terabyte or two of memory won’t handle, but servers like that are a little on the expensive side and lots of money spent on expensive hardware means less that can be spent on annual bonuses.

There is another option, a hidden, undocumented option that can improve query performance, maybe a little, maybe substantially.

First thing that you need to do to get this one is to enable the hidden options in sp_configure. This is done much the same way as the advanced options.

1
2
exec sp_configure 'show hidden options', 1
RECONFIGURE WITH EXTREME OVERRIDE

Once that’s done, the undocumented option can be enabled.

1
2
exec sp_configure 'run queries faster', 101010
 RECONFIGURE WITH EXTREME OVERRIDE

How much improvement this will give depends on the kind of queries being run. OLTP systems usually see a greater improvement than decision-support, unless there’s full text search or spatial queries, in which case there will likely be substantially less of a gain.

Now, there are a few things to consider.

  1. This is obviously undocumented and that means unsupported.
  2. It may not work on the next version of SQL.
  3. If you call support, disable the option first and don’t tell them you were running it!

.

Happy April Fools’ day.

.

..

….

Seriously now, there’s no options that, when enabled, makes SQL run queries faster. There is no silver bullet for performance problems, there is no one-size-fits-all fix.

Fixing performance problems involves finding the current bottleneck and removing it, then repeating that operation until performance is acceptable. It’s a complex area and there’s a lot to it. Simply throwing hardware at the problem may not produce much, if any, performance gain, especially if the hardware wasn’t the bottleneck.

If you have a query performance problem and don’t know where to start, ask on one of the SQL forums (like SQLServerCentral) if it’s not an urgent problem. If it is, or if there are serious problems, consider getting a consultant in to help out. One of the quickest ways to learn is to learn from someone who knows what they are doing.

Left outer join vs NOT EXISTS

And to wrap up the miniseries on IN, EXISTS and JOIN, a look at NOT EXISTS and LEFT OUTER JOIN for finding non-matching rows.

For previous parts, see

I’m looking at NOT EXISTS and LEFT OUTER JOIN, as opposed to NOT IN and LEFT OUTER JOIN, because, as shown in the previous part of this series, NOT IN behaves badly in the presence of NULLs. Specifically, if there are any NULLs in the result set, NOT IN returns 0 matches.

The LEFT OUTER JOIN, like the NOT EXISTS can handle NULLs in the second result set without automatically returning no matches. It behaves the same regardless of whether the join columns are nullable or not. Seeing as NULL does not equal anything, any rows in the second result set that have NULL for the join column are eliminated by the join and have no further effect on the query.

It is important, when using the LEFT OUTER JOIN … IS NULL, to carefully pick the column used for the IS NULL check. It should either be a non-nullable column (the primary key is a somewhat classical choice) or the join column (as nulls in that will be eliminated by the join)

Onto the tests

The usual test tables…

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn char(4) NOT NULL,
Filler CHAR(100)
)
 
CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn char(4) NOT NULL,
SomeArbDate Datetime default getdate()
)
 
INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b
 
INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3918 row(s) affected)

First without indexes

1
2
3
4
5
6
7
8
-- Query 1
SELECT BigTable.ID, SomeColumn
  FROM BigTable LEFT OUTER JOIN SmallerTable ON BigTable.SomeColumn = SmallerTable.LookupColumn
  WHERE LookupColumn IS NULL
 
-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

Let’s take a look at the execution plans

LeftOuterJoinNotIN_NotIndexed

(more…)

The Root of all Evil

Or “Shot gun query tuning

There have been a fair few forums questions in recent months asking for help in removing index scans, loop joins, sorts or other, presumed, slow query operators. There have been just as many asking how to replace a subquery with a join or a join with a subquery or similar aspects of a query usually for performance reasons.

The first question that I have to ask when looking at requests like that is “Why?”

Why is removing a particular query operator the goal? Why is changing a where clause predicate the goal? If it’s to make the query faster, has the query been examined and has it been confirmed that query operator or predicate really is the problem?

The title of this post refers to a comment I’ve seen again and again in blogs or articles about front-end development. “Premature optimisation is the root of all evils.” It’s true in the database field as well.

(more…)

NOT EXISTS vs NOT IN

Continuing with the mini-series on query operators, I want to have a look at NOT EXISTS and NOT IN.

Previous parts of this miniseries are:

Just one note before diving into that. The examples I’m using are fairly simplistic and that’s intentional. I’m trying to find what, if any, are the performance differences in a benchmark-style setup. I’ll have some comments on more complex examples in a later post.

The most important thing to note about NOT EXISTS and NOT IN is that, unlike EXISTS and IN,  they are not equivalent in all cases. Specifically, when NULLs are involved they will return different results. To be totally specific, when the subquery returns even one null, NOT IN will not match any rows.

The reason for this can be found by looking at the details of what the NOT IN operation actually means.

Let’s say, for illustration purposes that there are 4 rows in the table called t, there’s a column called ID with values 1..4

1
WHERE SomeValue NOT IN (SELECT AVal FROM t)

is equivalent to

1
2
3
4
5
6
7
8
9
WHERE (
SomeValue != (SELECT AVal FROM t WHERE ID=1)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=2)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=3)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=4)
)

Let’s further say that AVal is NULL where ID = 4. Hence that != comparison returns UNKNOWN. The logical truth table for AND states that UNKNOWN and TRUE is UNKNOWN, UNKNOWN and FALSE is FALSE. There is no value that can be AND’d with UNKNOWN to produce the result TRUE

Hence, if any row of that subquery returns NULL, the entire NOT IN operator will evaluate to either FALSE or UNKNOWN and no records will be returned

So what about EXISTS?

Exists cannot return NULL. It’s checking solely for the presence or absence of a row in the subquery and, hence, it can only return true or false. Since it cannot return NULL, there’s no possibility of a single NULL resulting in the entire expression evaluating to UNKNOWN.

Hence, when the column in the subquery that’s used for comparison with the outer table can have nulls in it, consider carefully which of NOT EXISTS or NOT IN you want to use.

Ok, but say there are no nulls in the column. How do they compare speed-wise. I’m going to do two tests, one where the columns involved in the comparison are defined as NULL and one where they are defined as NOT NULL. There will be no NULL values in the columns in either case. In both cases, the join columns will be indexed. After all, we all index our join columns, right?

So, first test, non-nullable columns. First some setup

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn char(4) NOT NULL,
Filler CHAR(100)
)
 
CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn char(4) NOT NULL,
SomeArbDate Datetime default getdate()
)
 
INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b
 
INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3898 row(s) affected)
 
CREATE INDEX idx_BigTable_SomeColumn
ON BigTable (SomeColumn)
CREATE INDEX idx_SmallerTable_LookupColumn
ON SmallerTable (LookupColumn)

Then the queries

1
2
3
4
5
6
7
-- Query 1
SELECT ID, SomeColumn FROM BigTable
WHERE SomeColumn NOT IN (SELECT LookupColumn FROM SmallerTable)
 
-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

The first thing to note is that the execution plans are identical.

ExecPlansNOTNULL

The execution characteristics are also identical.

Query 1
Table ‘BigTable’. Scan count 1, logical reads 342, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 8, physical reads 0.

SQL Server Execution Times:
CPU time = 156 ms,  elapsed time = 221 ms.

Query 2
Table ‘BigTable’. Scan count 1, logical reads 342, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 8, physical reads 0.

SQL Server Execution Times:
CPU time = 156 ms,  elapsed time = 247 ms.

So, at least for the case where the columns are defined as NOT NULL, these two perform the same.

What about the case where the columns are defined as nullable? I’m going to simply alter the two columns involved without changing anything else, then test out the two queries again.

1
2
3
4
5
ALTER TABLE BigTable
 ALTER COLUMN SomeColumn char(4) NULL
 
ALTER TABLE SmallerTable
 ALTER COLUMN LookupColumn char(4) NULL

And the same two queries

1
2
3
4
5
6
7
8
-- Query 1
 
SELECT ID, SomeColumn FROM BigTable
WHERE SomeColumn NOT IN (SELECT LookupColumn FROM SmallerTable)
 
-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

And as for their performance…

ExecPlansNull

Query 1
Table ‘SmallerTable’. Scan count 3, logical reads 500011, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 437, physical reads 0.

SQL Server Execution Times:
CPU time = 827 ms,  elapsed time = 825 ms.

Query 2
Table ‘BigTable’. Scan count 1, logical reads 437, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 9, physical reads 0.

SQL Server Execution Times:
CPU time = 156 ms,  elapsed time = 228 ms.

Radically different execution plans, radically different performance characteristics. The NOT IN took over 5 times longer to execute and did thousands of times more reads.

Why is that complex execution plan required when there may be nulls in the column? I can’t answer that one, probably only one of the query optimiser developers can, however the results are obvious. When the columns allow nulls but has none, the NOT IN performs significantly worse than NOT EXISTS.

So, take-aways from this?

Most importantly, NOT EXISTS and NOT IN do not have the same behaviour when there are NULLs involved. Chose carefully which you want.

Columns that will never contain NULL values should be defined as NOT NULL so that SQL knows there will never be NULL values in them and so that it doesn’t have to produce complex plans to handle potential nulls.

On non-nullable columns, the behaviour and performance of NOT IN and NOT EXISTS are the same, so use whichever one works better for the specific situation.

One more to go on this: LEFT OUTER JOIN with the IS NULL check vs NOT IN

SQL Server Usergroup – February meeting

The February meeting of the SA SQL Server usergroup will be on the 16th of February 2010. Venue and time are the same as always, 18h30 at the Microsoft offices

This month, Richard Sweetnam, one of Microsoft SA’s Premier Field Engineers will be presenting on Tips and Tricks for Management Studio.

Hope to see you all there.

Genetic Algorithms

Warning. This post has absolutely nothing to do with SQL Server.

What are Genetic Algorithms?

Genetic algorithms are a form of evolutionary computation, a branch of artificial intelligence that focuses on evolving effective or optimal solutions to difficult problems, based on the biological theory of evolution.

Genetic algorithms are, at their core, a search/optimisation technique. They are a way of finding maximum/minimum solutions to problems and, can be effective when there is no algorithmic solution to the problem. An example here would be the ‘Travelling Salesman’ problem.

Genetic algorithms work by taking an initial population of potential solutions (referred to as individuals), selecting a subset of the population that has the highest fitness then using that subset to generate a second generation. From the second generation again a subset with the highest fitness is selected and used to generate a third generation. This repeats until either the ‘fittest’ individual is considered a good enough solution, or until a certain number of generations have passed.

There are advantage to using genetic algorithms to solve problems over more traditional methods like hill climbing.

  • Genetic algorithms can quickly produce good solutions though they may take a lot of time to find the best solution. This is a benefit when the problem is such that the absolute best solution is not necessary, just one that is ‘good enough’
  • They are not susceptible to getting trapped by local maxima.
  • They do not work on the entire search space one potential solution at a time, but rather work on populations of potential solutions, focusing towards more optimal areas of the search space.

A genetic algorithm will almost always find an optimal solution, given enough time. The main downside is that they may take a lot of time to find that optimal solution.

(more…)

Getting here from there

Another month, another blog chain, this time started by Paul Randal. I got tagged by both Grant and Steve, on the same day.

I could easily think of two events that dramatically influenced where I am today, finding a third with as major an impact was difficult. I think the third one qualifies as an important enough event, while it didn’t really affect my career, it did influence my community involvement.

I canna take it anymore

I grew up surrounded by two things, computers and science fiction.

My father was a computer programmer in those days (today he runs a software company) and there were computers around from the earliest I remember. From the Sharp that I played Asteroids and The Valley on, to the NCR with it’s beeping keyboard where I first started programming (in a variant of basic), to the 80286 that my father gave me when he bought himself something faster. I’ve always had computers around that I could use. Despite that, I never had any intention of going into IT as a career.

My mother is a trekkie (classic Star Trek only please) so I grew up watching (and reading) lots of Science Fiction. From Star Trek to Dr Who to Battlestar Galactica to the entire science fiction collection at the local library I watched and read everything I could get my hands on, and it wasn’t long before I started reading Science fact as well as Science fiction. By the time I got to high school my career plans were leaning in the direction of Physics and Astronomy. Placing very high in the national Science Olympiad and almost winning a trip to Space Camp just strengthened those intentions.  I enjoyed playing with computers, but that was more a hobby (and, by that point, a place to play games)

I entered university with the intention to major in Physics, take a related subject as my second major and then get an Honours degree1 in Physics and find a job in astronomy or physics research. I took Computer Science as my second major because it was one of the few subjects that I was interested in that didn’t conflict with the other subjects I had to take (Chemistry 1 and Maths 1) I spend most of my spare time in my first two years in the Physics department library. I reckon that I must have read easily a third of that library in those two years

Just two problems with that intention. Firstly, there’s almost no demand in this country for physicists other than the universities and the national observatory. Secondly, by the time I got to 3rd year physics, I couldn’t handle the Maths involved. It was part way through the course on Quantum Physics (which contained more maths than some of the 3rd year maths courses did) that I realised that if I couldn’t handle the maths at this point, there was no way I’d ever be able to get a post-grad degree in physics.

I finished the Bachelors degree majoring in Physics and Computer Science and then applied for the honours degree in the Computer Science department

(1) In South Africa the Honours degree is a one year post-grad degree that sits between the Bachelors degree and the Masters degree.

(more…)

SQL Pass session evaluations

I finally got the last of my PASS Summit session evals and so, like some other people, I thought I’d make them public.

Lies, damned lies and statistics (DBA-388-S)

This session went very well. I was comfortable with the material, it’s a topic I really like and in general it felt, to me at least, like a good session. The ratings seem to agree with that.

Very Poor Poor Average Good Excellent
How would you rate the usefulness of the session information in your day-to-day environment? 1 7 36
How would you rate the Speaker’s presentation skills? 3 5 36
How would you rate the Speaker’s knowledge of the subject? 4 40
How would you rate the accuracy of the session title, description, and experience level to the actual session? 5 39
How would you rate the amount of time allocated to cover the topic/session? 11 33
How would you rate the quality of the presentation materials? 1 7 36

If I make Very Poor = 1 and Excellent = 5 then, averaging all the scores over all the questions, overall that session rated at 4.82/5

Not bad at all.

Edit: The overall PASS Summit session ratings are out and this session came in at 7th overall (all sessions including pre/post cons, all tracks) and 5th in the DBA track, behind only Buck Woody, Kimberly Tripp and Paul Randal I am extremely surprised to have come in that high at a conference like the PASS Summit.

Insight into Indexes (DBA-315)

This session was a whole different story. It did not go well at all, and I didn’t need the ratings to tell me that.

I wasn’t overly comfortable with the material. This is not to say that I didn’t know it, I did, but I wasn’t comfortable with it. In retrospect, I should have scrapped the entire presentation and done it over from scratch in a different way, even if that meant doing it the night before. Lesson learnt there.

To add to that, I broke my own rules for presentations. Usually I’m at the session room at least 5 minutes before the previous session finishes, with my laptop booted, the presentation loaded, management studio (and profiler if necessary) open and any pre-demo scripts already run. That way, as soon as the speaker who’s presenting in the session before mine finishes, I can get on stage, plug the laptop in, get the projector online and then relax.

In this case, I was late. The previous speaker had already left and my laptop was still switched off. Hence I rushed to get everything loaded and ready, and Windows, sensing the urgency, promptly crashed hard.

Cue 2 minutes of frantically trying to reboot laptop (it was ignoring all shut down requests) and load presentation onto the desktop in case my laptop didn’t reboot. All while the AV guy’s trying to get the audio on and the recording started.

Let’s just say it went downhill from there.

So, ratings for that one.

Very poor Poor Average Good Excellent
How would you rate the usefulness of the session information in your day-to-day environment? 2 1 7 23 51
How would you rate the Speaker’s presentation skills? 5 29 50
How would you rate the Speaker’s knowledge of the subject? 1 11 72
How would you rate the accuracy of the session title, description, and experience level to the actual session? 1 4 31 48
How would you rate the amount of time allocated to cover the topic/session? 6 31 47
How would you rate the quality of the presentation materials? 4 33 47

If I do the same averaging as for the first one, that comes out at 4.55. Not the worst I’ve ever had, though not by much. Lessons learnt.

IN vs INNER JOIN

Often in forum threads discussing query performance I’ll see people recommending replacing an INNER JOIN with an IN or EXISTS (or recommending replacing an IN or EXISTS with an INNER JOIN) for performance reasons. I’ve previously looked at how the IN and EXISTS compared, now I’m going to investigate and see how IN compares with the join.

One very important thing to note right off is that they are not equivalent in all cases.

An inner join between two tables does a complete join, it checks for matches and returns rows. This means, if there are multiple matching rows in the second table, multiple rows will be returned. Also, when two tables are joined, columns can be returned from either.  As a quick example:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
DECLARE @BigTable TABLE (
 id INT IDENTITY PRIMARY KEY,
 SomeColumn CHAR(4),
 Filler CHAR(100)
)
 
Insert into @BigTable(SomeColumn) Values (1)
Insert into @BigTable(SomeColumn) Values (2)
Insert into @BigTable(SomeColumn) Values (3)
Insert into @BigTable(SomeColumn) Values (4)
Insert into @BigTable(SomeColumn) Values (5)
 
 
DECLARE @SomeTable TABLE (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)
 
SELECT *
 FROM @BigTable b INNER JOIN @SomeTable  s ON b.SomeColumn = s.IntCol

This returns 7 rows and returns columns from both tables. Because the values in @SomeTable are duplicated, the matching rows from BigTable are returned twice.

With an IN, what is done is a semi-join, a join that checks for matches but does not return rows. This means if there are multiple matching tables in the resultset used for the IN, it doesn’t matter. Only one row from the first table will be returned. Also, because the rows are not returned, columns from the table referenced in the IN cannot be returned. As a quick example

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
DECLARE @BigTable TABLE (
 id INT IDENTITY PRIMARY KEY,
 SomeColumn CHAR(4),
 Filler CHAR(100)
)
 
Insert into @BigTable(SomeColumn) Values (1)
Insert into @BigTable(SomeColumn) Values (2)
Insert into @BigTable(SomeColumn) Values (3)
Insert into @BigTable(SomeColumn) Values (4)
Insert into @BigTable(SomeColumn) Values (5)
 
DECLARE @SomeTable TABLE (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)
 
SELECT *
 FROM @BigTable
 WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

This returns 5 rows and only columns from BigTable.

So, that said, how does the performance of the two differ for the cases where the results are identical (no duplicates in the second table, no columns needed from the second table)? For that, I’m going to need larger tables to play with. (more…)

Look back on 2009 and plans for the new year

Another year gone and a new one just starting. Time to take a look back at the the goals that I set for myself, which ones I’ve achieved and which I haven’t.

I did fairly well with the goals that I set back in October. The experiment design for my thesis is not documented and the WPF book is not finished, but the rest is all done and the exam was easily passed. So not too bad there. Better than I managed in the first half of the year.

And now for next year…

The biggest thing is that I’m going to take a near-complete break from the SQL community for at least the first half of the year so that I can focus on my thesis. No writing articles, no contributing to books, no giving or writing presentations. I’ll still blog, though likely only once a month on SQL-related stuff, though there may well be some AI stuff appearing from time to time. I’ll still be around on the SSC forums, though not as much as I currently am. I’ll also still be involved in the local usergroup. I can’t abandon that.

So, with that out of the way, the goals for the next six months:

  • Read one SQL book
  • Read at least two AI books
  • Get the experiment for my thesis designed and coded.
  • Write at least two chapters of the thesis
  • Get back into computer graphics and get two images done

To add to that, strange as it may seem, I’m going to ensure that I take time to read, relax and exercise away from the computer. This year I’ve spend too much time ‘busy’. I have a stack of books almost a metre high that I’ve bought but not opened. I have a similar stack of movies and games (though not quite as high) and I had a nasty bout of burnout Oct/Nov and I really don’t want  that again.