Syndication

Getting here from there

Another month, another blog chain, this time started by Paul Randal. I got tagged by both Grant and Steve, on the same day.

I could easily think of two events that dramatically influenced where I am today, finding a third with as major an impact was difficult. I think the third one qualifies as an important enough event, while it didn’t really affect my career, it did influence my community involvement.

I canna take it anymore

I grew up surrounded by two things, computers and science fiction.

My father was a computer programmer in those days (today he runs a software company) and there were computers around from the earliest I remember. From the Sharp that I played Asteroids and The Valley on, to the NCR with it’s beeping keyboard where I first started programming (in a variant of basic), to the 80286 that my father gave me when he bought himself something faster. I’ve always had computers around that I could use. Despite that, I never had any intention of going into IT as a career.

My mother is a trekkie (classic Star Trek only please) so I grew up watching (and reading) lots of Science Fiction. From Star Trek to Dr Who to Battlestar Galactica to the entire science fiction collection at the local library I watched and read everything I could get my hands on, and it wasn’t long before I started reading Science fact as well as Science fiction. By the time I got to high school my career plans were leaning in the direction of Physics and Astronomy. Placing very high in the national Science Olympiad and almost winning a trip to Space Camp just strengthened those intentions.  I enjoyed playing with computers, but that was more a hobby (and, by that point, a place to play games)

I entered university with the intention to major in Physics, take a related subject as my second major and then get an Honours degree1 in Physics and find a job in astronomy or physics research. I took Computer Science as my second major because it was one of the few subjects that I was interested in that didn’t conflict with the other subjects I had to take (Chemistry 1 and Maths 1) I spend most of my spare time in my first two years in the Physics department library. I reckon that I must have read easily a third of that library in those two years

Just two problems with that intention. Firstly, there’s almost no demand in this country for physicists other than the universities and the national observatory. Secondly, by the time I got to 3rd year physics, I couldn’t handle the Maths involved. It was part way through the course on Quantum Physics (which contained more maths than some of the 3rd year maths courses did) that I realised that if I couldn’t handle the maths at this point, there was no way I’d ever be able to get a post-grad degree in physics.

I finished the Bachelors degree majoring in Physics and Computer Science and then applied for the honours degree in the Computer Science department

(1) In South Africa the Honours degree is a one year post-grad degree that sits between the Bachelors degree and the Masters degree.

(more…)

SQL Pass session evaluations

I finally got the last of my PASS Summit session evals and so, like some other people, I thought I’d make them public.

Lies, damned lies and statistics (DBA-388-S)

This session went very well. I was comfortable with the material, it’s a topic I really like and in general it felt, to me at least, like a good session. The ratings seem to agree with that.

Very Poor Poor Average Good Excellent
How would you rate the usefulness of the session information in your day-to-day environment? 1 7 36
How would you rate the Speaker’s presentation skills? 3 5 36
How would you rate the Speaker’s knowledge of the subject? 4 40
How would you rate the accuracy of the session title, description, and experience level to the actual session? 5 39
How would you rate the amount of time allocated to cover the topic/session? 11 33
How would you rate the quality of the presentation materials? 1 7 36

If I make Very Poor = 1 and Excellent = 5 then, averaging all the scores over all the questions, overall that session rated at 4.82/5

Not bad at all.

Edit: The overall PASS Summit session ratings are out and this session came in at 7th overall (all sessions including pre/post cons, all tracks) and 5th in the DBA track, behind only Buck Woody, Kimberly Tripp and Paul Randal I am extremely surprised to have come in that high at a conference like the PASS Summit.

Insight into Indexes (DBA-315)

This session was a whole different story. It did not go well at all, and I didn’t need the ratings to tell me that.

I wasn’t overly comfortable with the material. This is not to say that I didn’t know it, I did, but I wasn’t comfortable with it. In retrospect, I should have scrapped the entire presentation and done it over from scratch in a different way, even if that meant doing it the night before. Lesson learnt there.

To add to that, I broke my own rules for presentations. Usually I’m at the session room at least 5 minutes before the previous session finishes, with my laptop booted, the presentation loaded, management studio (and profiler if necessary) open and any pre-demo scripts already run. That way, as soon as the speaker who’s presenting in the session before mine finishes, I can get on stage, plug the laptop in, get the projector online and then relax.

In this case, I was late. The previous speaker had already left and my laptop was still switched off. Hence I rushed to get everything loaded and ready, and Windows, sensing the urgency, promptly crashed hard.

Cue 2 minutes of frantically trying to reboot laptop (it was ignoring all shut down requests) and load presentation onto the desktop in case my laptop didn’t reboot. All while the AV guy’s trying to get the audio on and the recording started.

Let’s just say it went downhill from there.

So, ratings for that one.

Very poor Poor Average Good Excellent
How would you rate the usefulness of the session information in your day-to-day environment? 2 1 7 23 51
How would you rate the Speaker’s presentation skills? 5 29 50
How would you rate the Speaker’s knowledge of the subject? 1 11 72
How would you rate the accuracy of the session title, description, and experience level to the actual session? 1 4 31 48
How would you rate the amount of time allocated to cover the topic/session? 6 31 47
How would you rate the quality of the presentation materials? 4 33 47

If I do the same averaging as for the first one, that comes out at 4.55. Not the worst I’ve ever had, though not by much. Lessons learnt.

IN vs INNER JOIN

Often in forum threads discussing query performance I’ll see people recommending replacing an INNER JOIN with an IN or EXISTS (or recommending replacing an IN or EXISTS with an INNER JOIN) for performance reasons. I’ve previously looked at how the IN and EXISTS compared, now I’m going to investigate and see how IN compares with the join.

One very important thing to note right off is that they are not equivalent in all cases.

An inner join between two tables does a complete join, it checks for matches and returns rows. This means, if there are multiple matching rows in the second table, multiple rows will be returned. Also, when two tables are joined, columns can be returned from either.  As a quick example:

DECLARE @BigTable TABLE (
 id INT IDENTITY PRIMARY KEY,
 SomeColumn CHAR(4),
 Filler CHAR(100)
)

Insert into @BigTable(SomeColumn) Values (1)
Insert into @BigTable(SomeColumn) Values (2)
Insert into @BigTable(SomeColumn) Values (3)
Insert into @BigTable(SomeColumn) Values (4)
Insert into @BigTable(SomeColumn) Values (5)


DECLARE @SomeTable TABLE (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM @BigTable b INNER JOIN @SomeTable  s ON b.SomeColumn = s.IntCol

This returns 7 rows and returns columns from both tables. Because the values in @SomeTable are duplicated, the matching rows from BigTable are returned twice.

With an IN, what is done is a semi-join, a join that checks for matches but does not return rows. This means if there are multiple matching tables in the resultset used for the IN, it doesn’t matter. Only one row from the first table will be returned. Also, because the rows are not returned, columns from the table referenced in the IN cannot be returned. As a quick example

DECLARE @BigTable TABLE (
 id INT IDENTITY PRIMARY KEY,
 SomeColumn CHAR(4),
 Filler CHAR(100)
)

Insert into @BigTable(SomeColumn) Values (1)
Insert into @BigTable(SomeColumn) Values (2)
Insert into @BigTable(SomeColumn) Values (3)
Insert into @BigTable(SomeColumn) Values (4)
Insert into @BigTable(SomeColumn) Values (5)

DECLARE @SomeTable TABLE (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM @BigTable
 WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

This returns 5 rows and only columns from BigTable.

So, that said, how does the performance of the two differ for the cases where the results are identical (no duplicates in the second table, no columns needed from the second table)? For that, I’m going to need larger tables to play with. (more…)

Are trivial plans cached?

It is sometimes said that trivial execution plans are not cached and queries that have such plans are compiled on every execution. So is that true? To effectively answer this question, we must first establish what a trivial plan is.

A trivial plan is essentially a plan for a query where a specific plan will always be the most optimal way of executing it. If we consider something like SELECT * FROM SomeTable then there’s only one real way to execute it, a scan of the cluster/heap.

The trivial plan is somewhat of a query optimiser optimisation. If the query qualifies for a trivial plan (and there are lots of restrictions) then the full optimisation process doesn’t need to be started and so the query’s execution plan can be generated quicker and with less overhead. The fact that a query has a trivial plan at one point doesn’t necessarily mean that it will always have a trivial plan, indexes may be added that make the selection of plan less of a sure thing and so the query must go for full optimisation, rather than getting a trivial plan

Nice theory, but how does one tell if a particular query has a trivial execution plan? The information is found within the execution plan, the properties of the highest-level operator has an entry ‘Optimisation level’ For a trivial plan this will read ‘TRIVIAL’

Trivial plan

(more…)

The most optimal join type

What’s the best join type for a query? Should we aspire to seeing nested loop joins in all our queries? Should we tremble with horror at the sight of a hash join?

Well, it depends. 🙂

There’s no single join type that’s best in every scenario and there’s no join type that’s bad to have in every scenario. If one of the join types, say the much maligned hash join, was very much a sub-optimal join type in every single scenario, then there would be no reason for it to be in the product and no reason for the optimiser to ever select it for a plan. Since there are three join types, and the optimiser can and does use all three, we must assume that they are all useful under some circumstances.

I took a look at the joins a while back, but it’s worth revisiting.

The nested loop join

A nested loop join is an optimal join type when two conditions are true.

  1. One of the resultsets contains quite a small number of rows.
  2. The other table has an index on the join column(s).

When both of these are true, SQL can do a very efficient nested loop. The smaller resultset becomes the outer table of the join, a loop runs across all the rows in that resultset and index seeks are done to look up the matching rows in the inner table. It’s important to note that the number of seeks against the inner table will not be less than the number of rows in the outer table, at the point the join occurs

If the one resultset has a small number of rows but there is no index on the other table on the join column, then a loop join can still be done, but is less optimal as the entire of the inner table (or a subset based on another filter condition) must be read on each iteration of the loop.

If both resultsets have large numbers of rows but there is an index on the join columns in one of the tables then the nested loop can still read through one of the resultsets and do index seeks to locate matching rows, but the number of rows in the outer table will mean lots and lots of seek operations, which may result in a sub-optimal plan.

(more…)

TechEd Online Interview

Back in August at TechEd Africa I did a TechEd Online interview with Frikkie Bosch. Frikkie’s the marketing manager for the Server products down here in South Africa. We discussed some common mistakes that I’ve seen regarding SQL Server performance. I’m not talking specifics of query/table design here, but rather mistakes at a higher level.

The interview is available on the TechEd Online site. I’m interested in what people think, am I on the mark or completely in the wrong ballpark?

More on the MVP SQL Server book

If you buy from Manning directly today or tomorrow, there’s a 50% discount . Use the code pop0928 at the checkout. Hurry, hurry, hurry….

For those who want a copy and are going to the PASS Summit in November, why not pick up your copy of the book there? The PASS bookshop will be selling them (though no word yet on how many) and many of the authors will be at the Summit if you want to get the book autographed.

For more details, see http://sqlblog.com/blogs/paul_nielsen/archive/2009/09/29/53-mvps-warchild-org-and-sqlservermvpdeepdives-com.aspx

MVP book is available for preorder

Last year a large number of MVPs (mostly SQL Server) got together to write a book with all proceeds going to charity. The book consists of over 50 chapters, each on a different SQL-Server related topic, from database architecture through development administration, database development, performance tuning and even Business Intelligence. I contributed a chapter on deadlock graphs, how to dissect the deadlock graph and see what locks were taken and requested and by what statements.

The book’s been delayed a couple times for various reasons, but it is now available for pre-order from a number of places:

Why is CheckDB rolling back transactions?

I saw this one on a forum recently and it’s a fun question to look into.

When running CheckDB on an in-use database, sometimes messages like the following will appear in the SQL error log

1 transactions rolled back in database ‘Testing’ (8). This is an informational message only. No user action is required.
DBCC CHECKDB (Testing) WITH no_infomsgs executed by theAdmin found 0 errors and repaired 0 errors. Elapsed time: 0 hours 0 minutes 31 seconds.

What’s going on here? Why is checkDB rolling transactions back?

Well, the truth is, it’s not rolling anything back.

When CheckDB runs it has to be able to get a consistent view of the database, one without half-done transactions, without data modifications in progress. On SQL 2000 (and earlier), CheckDB used the transaction log to get that consistent view. It would read through the transaction log to see what had been done to the various structures in the database. In SQL 2005 this was changed and CheckDB now uses the database snapshot feature to get its consistent, point-in-time view of the database.

Database Snapshots

A database snapshot is a read-only, point-in-time copy of a database. It uses a combination of sparse file and copy-on-write technology to reduce space usage and, as a result, it’s usually much smaller than the database it was based on.

When a snapshot is created, any open transactions are rolled back within the context of the database snapshot to generate the consistent, point-in-time copy. This generates messages in the SQL error log stating how many transactions were rolled back

CheckDB

The first thing that CheckDB has to do is to create an internal, hidden database snapshot to get the consistent view of the DB that it needs. It will do this unless it’s run with the TABLOCK option. When the snapshot is created, any open transactions will be rolled back but only in the context of the snapshot, not in the context of the actual database. CheckDB will then run against the snapshot and will drop it once complete.

So the answer to the question of why checkDB is rolling back transactions is a simple one. It isn’t.

Estimated rows, actual rows and execution count

It’s often said that a major discrepancy between estimated and actual row counts in a query’s execution plan is a sign of inaccurate statistics or a poor cardinality estimate and that it’s a sign of a problem. This is generally true, however there are places where the estimates and actual rows will differ, often quite dramatically, without it been a problem. The reason for this is that these two values show slightly different things.

Let’s take a look at an example. (table creation code at the end of the post)

select bt.id, bt.SomeColumn, st.SomeArbDate
from dbo.BigTable bt
inner join dbo.SmallerTable st on bt.SomeColumn = st.LookupColumn
where bt.id between 5000 and 5100

Estimated Actual discrepency

Estimated rows = 1, actual rows = 101. That’s a large discrepancy, but what caused it? It’s not out of date statistics (a usual cause) because the table has only just been created, so why is the estimation so far from the actual.

(more…)