NOT EXISTS vs NOT IN

Continuing with the mini-series on query operators, I want to have a look at NOT EXISTS and NOT IN.

Just one note before diving into that. The examples I’m using are fairly simplistic and that’s intentional. I’m trying to find what, if any, are the performance differences in a benchmark-style setup. I’ll have some comments on more complex examples in a later post.

The most important thing to note about NOT EXISTS and NOT IN is that, unlike EXISTS and IN,  they are not equivalent in all cases. Specifically, when NULLs are involved they will return different results. To be totally specific, when the subquery returns even one null, NOT IN will not match any rows.

The reason for this can be found by looking at the details of what the NOT IN operation actually means.

Let’s say, for illustration purposes that there are 4 rows in the table called t, there’s a column called ID with values 1..4

WHERE SomeValue NOT IN (SELECT AVal FROM t)
is equivalent to
WHERE (
SomeValue != (SELECT AVal FROM t WHERE ID=1)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=2)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=3)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=4)
)

Let’s further say that AVal is NULL where ID = 4. Hence that != comparison returns UNKNOWN. The logical truth table for AND states that UNKNOWN and TRUE is UNKNOWN, UNKNOWN and FALSE is FALSE. There is no value that can be AND’d with UNKNOWN to produce the result TRUE

Hence, if any row of that subquery returns NULL, the entire NOT IN operator will evaluate to either FALSE or NULL and no records will be returned

So what about EXISTS?

Continue reading ‘NOT EXISTS vs NOT IN’

SQL Server Usergroup – February meeting

The February meeting of the SA SQL Server usergroup will be on the 16th of February 2010. Venue and time are the same as always, 18h30 at the Microsoft offices

This month, Richard Sweetnam, one of Microsoft SA’s Premier Field Engineers will be presenting on Tips and Tricks for Management Studio.

Hope to see you all there.

Genetic Algorithms

Warning. This post has absolutely nothing to do with SQL Server.

What are Genetic Algorithms?

Genetic algorithms are a form of evolutionary computation, a branch of artificial intelligence that focuses on evolving effective or optimal solutions to difficult problems, based on the biological theory of evolution.

Genetic algorithms are, at their core, a search/optimisation technique. They are a way of finding maximum/minimum solutions to problems and, can be effective when there is no algorithmic solution to the problem. An example here would be the ‘Travelling Salesman’ problem.

Genetic algorithms work by taking an initial population of potential solutions (referred to as individuals), selecting a subset of the population that has the highest fitness then using that subset to generate a second generation. From the second generation again a subset with the highest fitness is selected and used to generate a third generation. This repeats until either the ‘fittest’ individual is considered a good enough solution, or until a certain number of generations have passed.

There are advantage to using genetic algorithms to solve problems over more traditional methods like hill climbing.

  • Genetic algorithms can quickly produce good solutions though they may take a lot of time to find the best solution. This is a benefit when the problem is such that the absolute best solution is not necessary, just one that is ‘good enough’
  • They are not susceptible to getting trapped by local maxima.
  • They do not work on the entire search space one potential solution at a time, but rather work on populations of potential solutions, focusing towards more optimal areas of the search space.

A genetic algorithm will almost always find an optimal solution, given enough time. The main downside is that they may take a lot of time to find that optimal solution.

Continue reading ‘Genetic Algorithms’

Getting here from there

Another month, another blog chain, this time started by Paul Randal. I got tagged by both Grant and Steve, on the same day.

I could easily think of two events that dramatically influenced where I am today, finding a third with as major an impact was difficult. I think the third one qualifies as an important enough event, while it didn’t really affect my career, it did influence my community involvement.

I canna take it anymore

I grew up surrounded by two things, computers and science fiction.

My father was a computer programmer in those days (today to runs a software company) and there were computers around from the earliest I remember. From the Sharp that I played Asteroids and The Valley on, to the NCR with it’s beeping keyboard where I first started programming (in a variant of basic), to the 80286 that my father gave me when he bought himself something faster. I’ve always had computers around that I could use. Despite that, I never had any intention of going into IT as a career.

My mother is a trekkie (classic Star Trek only please) so I grew up watching (and reading) lots of Science Fiction. From Star Trek to Dr Who to Battlestar Galactica to the entire science fiction collection at the local library I watched and read everything I could get my hands on, and it wasn’t long before I started reading Science fact as well as Science fiction. By the time I got to high school my career plans were leaning in the direction of Physics and Astronomy. Placing very high in the national Science Olympiad and almost winning a trip to Space Camp just strengthened those intentions.  I enjoyed playing with computers, but that was more a hobby (and, by that point, a place to play games)

I entered university with the intention to major in Physics, take a related subject as my second major and then get an Honours degree1 in Physics and find a job in astronomy or physics research. I took Computer Science as my second major because it was one of the few subjects that I was interested in that didn’t conflict with the other subjects I had to take (Chemistry 1 and Maths 1) I spend most of my spare time in my first two years in the Physics department library. I reckon that I must have read easily a third of that library in those two years

Just two problems with that intention. Firstly, there’s almost no demand in this country for physicists other than the universities and the national observatory. Secondly, by the time I got to 3rd year physics, I couldn’t handle the Maths involved. It was part way through the course on Quantum Physics (which contained more maths than some of the 3rd year maths courses did) that I realised that if I couldn’t handle the maths at this point, there was no way I’d ever be able to get a post-grad degree in physics.

I finished the Bachelors degree majoring in Physics and Computer Science and then applied for the honours degree in the Computer Science department

(1) In South Africa the Honours degree is a one year post-grad degree that sits between the Bachelors degree and the Masters degree.

Continue reading ‘Getting here from there’

SQL Pass session evaluations

I finally got the last of my PASS Summit session evals and so, like some other people, I thought I’d make them public.

Lies, damned lies and statistics (DBA-388-S)

This session went very well. I was comfortable with the material, it’s a topic I really like and in general it felt, to me at least, like a good session. The ratings seem to agree with that.

Very Poor Poor Average Good Excellent
How would you rate the usefulness of the session information in your day-to-day environment? 1 7 36
How would you rate the Speaker’s presentation skills? 3 5 36
How would you rate the Speaker’s knowledge of the subject? 4 40
How would you rate the accuracy of the session title, description, and experience level to the actual session? 5 39
How would you rate the amount of time allocated to cover the topic/session? 11 33
How would you rate the quality of the presentation materials? 1 7 36

If I make Very Poor = 1 and Excellent = 5 then, averaging all the scores over all the questions, overall that session rated at 4.82/5

Not bad at all.

Edit: The overall PASS Summit session ratings are out and this session came in at 7th overall (all sessions including pre/post cons, all tracks) and 5th in the DBA track, behind only Buck Woody, Kimberly Tripp and Paul Randal I am extremely surprised to have come in that high at a conference like the PASS Summit.

Insight into Indexes (DBA-315)

This session was a whole different story. It did not go well at all, and I didn’t need the ratings to tell me that.

I wasn’t overly comfortable with the material. This is not to say that I didn’t know it, I did, but I wasn’t comfortable with it. In retrospect, I should have scrapped the entire presentation and done it over from scratch in a different way, even if that meant doing it the night before. Lesson learnt there.

To add to that, I broke my own rules for presentations. Usually I’m at the session room at least 5 minutes before the previous session finishes, with my laptop booted, the presentation loaded, management studio (and profiler if necessary) open and any pre-demo scripts already run. That way, as soon as the speaker who’s presenting in the session before mine finishes, I can get on stage, plug the laptop in, get the projector online and then relax.

In this case, I was late. The previous speaker had already left and my laptop was still switched off. Hence I rushed to get everything loaded and ready, and Windows, sensing the urgency, promptly crashed hard.

Cue 2 minutes of frantically trying to reboot laptop (it was ignoring all shut down requests) and load presentation onto the desktop in case my laptop didn’t reboot. All while the AV guy’s trying to get the audio on and the recording started.

Let’s just say it went downhill from there.

So, ratings for that one.

Very poor Poor Average Good Excellent
How would you rate the usefulness of the session information in your day-to-day environment? 2 1 7 23 51
How would you rate the Speaker’s presentation skills? 5 29 50
How would you rate the Speaker’s knowledge of the subject? 1 11 72
How would you rate the accuracy of the session title, description, and experience level to the actual session? 1 4 31 48
How would you rate the amount of time allocated to cover the topic/session? 6 31 47
How would you rate the quality of the presentation materials? 4 33 47

If I do the same averaging as for the first one, that comes out at 4.55. Not the worst I’ve ever had, though not by much. Lessons learnt.

IN vs INNER JOIN

Often in forum threads discussing query performance I’ll see people recommending replacing an INNER JOIN with an IN (or recommending replacing an IN with an INNER JOIN) for performance reasons. Hence it seems to be a good idea to investigate and see what the performance differences (if any) really are.

One very important thing to note right off is that they are not equivalent in all cases.

An inner join between two tables does a complete join, it checks for matches and returns rows. This means, if there are multiple matching rows in the second table, multiple rows will be returned. Also, when two tables are joined, columns can be returned from either.  As a quick example (definition of BigTable towards the end of the post)

DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM BigTable b INNER JOIN @SomeTable  s ON b.SomeColumn IN s.IntCol

This returns 7 rows and returns columns from both tables. Because the values in @SomeTable are duplicated, the matching rows from BigTable are returned twice.

With an IN, what is done is a semi-join, a join that checks for matches but does not return rows. This means if there are multiple matching tables in the resultset used for the IN, it doesn’t matter. Only one row from the first table will be returned. Also, because the rows are not returned, columns from the table referenced in the IN cannot be returned. As a quick example

DECLARE @SomeTable (IntCol int)
 Insert into @SomeTable (IntCol) Values (1)
 Insert into @SomeTable (IntCol) Values (2)
 Insert into @SomeTable (IntCol) Values (2)
 Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
 Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM BigTable
 WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

This returns 5 rows and only columns from BigTable.

So, that said, how does the performance of the two differ for the cases where the results are identical (no duplicates in the second table, no columns needed from the second table)? For that, I’m going to need larger tables to play with. Continue reading ‘IN vs INNER JOIN’

Look back on 2009 and plans for the new year

Another year gone and a new one just starting. Time to take a look back at the the goals that I set for myself, which ones I’ve achieved and which I haven’t.

I did fairly well with the goals that I set back in October. The experiment design for my thesis is not documented and the WPF book is not finished, but the rest is all done and the exam was easily passed. So not too bad there. Better than I managed in the first half of the year.

And now for next year…

The biggest thing is that I’m going to take a near-complete break from the SQL community for at least the first half of the year so that I can focus on my thesis. No writing articles, no contributing to books, no giving or writing presentations. I’ll still blog, though likely only once a month on SQL-related stuff, though there may well be some AI stuff appearing from time to time. I’ll still be around on the SSC forums, though not as much as I currently am. I’ll also still be involved in the local usergroup. I can’t abandon that.

So, with that out of the way, the goals for the next six months:

  • Read one SQL book
  • Read at least two AI books
  • Get the experiment for my thesis designed and coded.
  • Write at least two chapters of the thesis
  • Get back into computer graphics and get two images done

To add to that, strange as it may seem, I’m going to ensure that I take time to read, relax and exercise away from the computer. This year I’ve spend too much time ‘busy’. I have a stack of books almost a metre high that I’ve bought but not opened. I have a similar stack of movies and games (though not quite as high) and I had a nasty bout of burnout Oct/Nov and I really don’t want  that again.

Greatest weakness

Someone went off and started asking people what their greatest weaknesses are, then someone else decided to pass the question my way. Perhaps those someones’ weaknesses are curiosity… Still, since everyone else is revealing their darkest secrets, I’ll give it a go as well.

Mine would have to be two very closely related things. Procrastination and short attention span. (why do you think this post is 4 days late?)

I tend to put stuff off until the absolute last minute (and sometimes beyond then) and then work frantically to get it finished. And because I’m also a bit of a perfectionist, I’m not satisfied with half-done work or a hack job so I’ll be working late into the night (or over a weekend) to get it done properly.

To add to that, unless I’m doing something I really enjoy, my attention span tends not to be very long. 15-20 minutes is good, it’s usually less. After that it’s either fight to stay focused or take a short break. In and of itself, that’s not so bad, the problem is that the breaks are often not so short, if I get caught up in whatever else I’m doing, or distracted by something else (and something else and something else).

So what am I doing about this?

Continue reading ‘Greatest weakness’

Are trivial plans cached?

It is sometimes said that trivial execution plans are not cached and queries that have such plans are compiled on every execution. So is that true? To effectively answer this question, we must first establish what a trivial plan is.

A trivial plan is essentially a plan for a query where a specific plan will always be the most optimal way of executing it. If we consider something like SELECT * FROM SomeTable then there’s only one real way to execute it, a scan of the cluster/heap.

The trivial plan is somewhat of a query optimiser optimisation. If the query qualifies for a trivial plan (and there are lots of restrictions) then the full optimisation process doesn’t need to be started and so the query’s execution plan can be generated quicker and with less overhead. The fact that a query has a trivial plan at one point doesn’t necessarily mean that it will always have a trivial plan, indexes may be added that make the selection of plan less of a sure thing and so the query must go for full optimisation, rather than getting a trivial plan

Nice theory, but how does one tell if a particular query has a trivial execution plan? The information is found within the execution plan, the properties of the highest-level operator has an entry ‘Optimisation level’ For a trivial plan this will read ‘TRIVIAL’

Trivial plan

Continue reading ‘Are trivial plans cached?’

The most optimal join type

What’s the best join type for a query? Should we aspire to seeing nested loop joins in all our queries? Should we tremble with horror at the sight of a hash join?

Well, it depends. :-)

There’s no single join type that’s best in every scenario and there’s no join type that’s bad to have in every scenario. If one of the join types, say the much maligned hash join, was very much a sub-optimal join type in every single scenario, then there would be no reason for it to be in the product and no reason for the optimiser to ever select it for a plan. Since there are three join types, and the optimiser can and does use all three, we must assume that they are all useful under some circumstances.

I took a look at the joins a while back, but it’s worth revisiting.

The nested loop join

A nested loop join is an optimal join type when two conditions are true.

  1. One of the resultsets contains quite a small number of rows.
  2. The other table has an index on the join column(s).

When both of these are true, SQL can do a very efficient nested loop. The smaller resultset becomes the outer table of the join, a loop runs across all the rows in that resultset and index seeks are done to look up the matching rows in the inner table. It’s important to note that the number of seeks against the inner table will not be less than the number of rows in the outer table, at the point the join occurs

If the one resultset has a small number of rows but there is no index on the other table on the join column, then a loop join can still be done, but is less optimal as the entire of the inner table (or a subset based on another filter condition) must be read on each iteration of the loop.

If both resultsets have large numbers of rows but there is an index on the join columns in one of the tables then the nested loop can still read through one of the resultsets and do index seeks to locate matching rows, but the number of rows in the outer table will mean lots and lots of seek operations, which may result in a sub-optimal plan.

Continue reading ‘The most optimal join type’