Latest Posts

All indexes are unique

Well, that’s a rather contentious title. There are probably several people shaking their heads at this point. Let me explain.

I was listening to a podcast with Kimberly Tripp this morning, and she mentioned this briefly. I thought it would be a good discussion to end a short series on indexes and selectivity.

The Clustered Index

A clustered index has to be unique, because the clustering key acts as the row’s location in the table. If the index is not defined as unique, SQL will make it unique by adding a uniquifier, a 4-byte integer that’s hidden behind the scenes and is added when necessary to make the clustered index unique.

It’s not documented anywhere clearly, but it is mentioned in a couple of places. From msdn:

If the clustered index is not a unique index, SQL Server makes any duplicate keys unique by adding an internally generated value called a uniqueifier. This four-byte value is not visible to users. It is only added when required to make the clustered key unique for use in nonclustered indexes. SQL Server retrieves the data row by searching the clustered index using the clustered index key stored in the leaf row of the nonclustered index.

So all clustered indexes are unique.

(more…)

SQL Server usergroup – Feb meeting

The Feb meeting wll be held Tuesday the 17th Feb at 18h30 at the Microsoft offices in Bryanston.

Inus du Preez will be presenting a session “6 reasons why SQL Server 2008 must run on Windows Server 2008”

Inus will be covering improvements in Server 2008 such as network throughput, memory swapping, drive addressing and stability and why they make Server 2008 such a compelling platform for running SQL 2008 on.

He’s also going to look at running SQL under Hyper-V and under the Core installation.

As always, please RSVP by replying to this post, or replying to the thread on SADeveloper.

Index columns, selectivity and inequality predicates

So, following on from my post last week, I’m going to take a look at how selectivity and index column order affect inequality predicates.

One thing to note straight off is that the selectivity of a column is much less important for inequality predicates than it was for equality. For equality predicates, the selectivity alone can give a reasonable idea of the number of rows a particular predicate will return. That’s not the case with inequalities. Also, with inequality predicates, the order of columns in the index becomes very important.

One of the most important considerations with inequality predicates is the number of rows that the predicate will return. An identity column may be highly selective, but if the filter is for all rows > 0 and the identity values start t one, then an index on that column is not going to be very useful.

The other consideration when there are inequality predicates is that only that column and columns to the left of it in the index key can be used for index seeks. Any columns to the right of the column with the inequality is no longer eligible for seeking.

To explain with an example, consider our hypothetical table from the previous post (with one small change):

1
2
3
4
5
CREATE TABLE ConsideringIndexOrder (
ID INT,
SomeString VARCHAR (100),
SomeDate DATETIME DEFAULT GETDATE()
); 

The same as previously, there’s a single nonclustered index on all three columns, in the order ID, SomeDate, SomeString.

If there’s an inequality predicate, then then the index is only fully seekable for the following queries
…  WHERE ID = @ID AND SomeDate = @dt AND SomeString > @str
…  WHERE ID = @ID AND SomeDate > @dt
…  WHERE ID > @ID

(more…)

A Bookmark lookup, by any other name…

I think I may have confused some people by talking about bookmark lookups. I’ll attempt to clarify things.

The operator that I’m talking about is the one that fetches extra columns from the clustered index when the nonclustered index that’s used to retrieve the rows doesn’t have all of the columns required.

In SQL 2000, that operator appeared in the execution plan as a bookmark lookup and it appeared as follows:

In SQL 2005, the name was changed, and the bookmark lookup no longer appeared in the execution plan. In it’s place appeared a clustered index seek, joined back to the original index seek by a nested loop join. It appeared as follows (the highlighted operator is the ‘bookmark lookup’)

This change made it harder to see what was going on as clustered index seeks are usually considered ‘good’. The trick to see if it is actually a ‘bookmark lookup’ is to look at the objects involved. When the nonclustered index seek and the clustered index seek are both on the same table, then it’s a ‘bookmark lookup’

I don’t recall what this appeared as when the base table was a heap, not a cluster.

In SQL 2005 SP2, the name of the operator was changed again, now appearing as a key lookup (when the base table has a clustered index) or a RID lookup (when the base table is a heap). It now looks like this:

The thing to note is that it’s not the version of the server that’s important. The format of the XML for the exec plan has not changed since SQL 2005 RTM (I can and have created a .sqlplan file from SQL 2008 and opened that file in SQL 2005’s management studio).

It’s the version of management studio that affects how the execution plans are displayed. If the server is SQL 2005 SP3, but the client tools are still RTM, the bookmark lookup will appear as a clustered index seek. Another reason to patch the client as well as the server

I hope that clears up some of the confusion around the naming. So, in future, what should I refer to this as? A bookmark lookup? A Key lookup?

Bad Advice

It’s no secret that I’ve been fairly active on a couple of SQL forums for a couple of years. In that time I’ve seen all sorts of behaviour on forums, good and bad. By this point, most of that just doesn’t bother me any longer. There is one thing however that still gets me angry when I see it. Blatently bad advice.

Now, I’m not talking about first-attempt solutions that solve half of the query problem, I’m not talking about honest mistakes and I’m not talking about attempted answers to questions so vague they’re near-impossible to understand. I’m talking about advice that’s so bad it’s dangerous. I’m talking about things like this:

Q: My transaction log’s very large. What can I do to fix this?
A: Stop SQL, find the ldf file, delete it and then start SQL

Q: Dropping a clustered index on a large table takes a long time using drop index. Is there a faster way?
A: Run sp_configure “allow updates, 1 and then run delete from sysindexes where id = OBJECT_ID(‘MyTableName’) AND indid = 1

Ouch!

There’s two main problems with bad advice.

Firstly, the person asking possibly doesn’t understand enought to realise the advice is risky, and if they follow it without testing they could end up in a much worse situation than they were. Dependiong on the circumstances they may end up in trouble with their boss, they may even end up getting fired. Whether they realise the advice is bad before or after trying it out, it’ll erode their faith both in the forum and in the person who provided the information. That leads to the the second problem.

The second problem is damaged credibility and reputation. A good reputation is so hard to get in this industry and so easy to lose. Bad advice damages the forum’s reputation and the reputation of the person posting the advice. Brent Ozar puts it very well: “Being an expert means having credibility. It doesn’t matter how much you know if people don’t trust your answers.”

Then, of course, some other person on the forum has to come along and correct the bad advice and explain why it’s bad so that people who find the thread via google won’t think it’s useful

Bottom line, bad advice helps no one and hurts a number of people.

</soapbox>

Index columns, selectivity and equality predicates

Or “Which column goes first?

There’s a common piece of advice given about columns in an index key that says that the most selective column should go first. I’m not going to say that’s incorrect, because it’s not. The problem is that it’s often given without any explanation as to why the most selective column should go first, nor are the other considerations for index key order mentioned.

This can lead to misunderstandings like, in the extreme case, where one person after hearing that advice went and added the primary key column as the leading column of every single nonclustered index (because it’s highly selective), and then wondered why his database performance decreased dramatically.

The comment about selectivity is because of the way SQL keeps statistics on indexes (see my post on statistics for more info on what they are). SQL only keeps the histogram for the first column of the index. That means that it only knows the actual distribution of values of the first column. If the first column is not selective, the index may not be used. However, that’s not the whole story.

SQL also, in addition to the histogram, keeps density values for all of the left-based subsets of the index keys. So, for a 3 column index key, SQL knows the density of the first column, of the first and second and of all three. The density is, in a nutshell, a value that shows how unique the set of columns is. It’s 1/(distinct values). The value can be seen for any index using DBCC Show_Statistics with the DENSITY_VECTOR option.

This means, while SQL only knows the actual data distribution of the first column, it does know, on average, how many rows will be returned by an equality match on any left-based subset of the index keys

So, what’s my rule for the order of columns in an index key? Put the most selective columns first, when all other considerations are equal.

(more…)

SQL Server usergroup – Jan meeting

The January meeting of the SA SQL user group will be on Tuesday 20th of Jan 2009 at 18h30 at the Microsoft Offices in Bryanston.

We’re still trying to finalise a speaker, there’ll be an update on that in a day or two. We definitely will have someone, that’s not in doubt.

Please let me know if you’ll be attending. You can RSVP by mail, by replying to this post, or by sending me a PM either from SADeveloper.net or SQLServerCentral.com (Username – GilaMonster)

Update:

The speaker we were hoping to line up isn’t available, so I’ll be presenting again.

I’ll be doing the session that I presented at the PASS conference in Seattle last year – The dirty dozen… 12 Ways to write badly performing T-SQL.

Seek or scan?

One very common question that I see on the forums is on index seeks and index scans. A query is resulting in a table/clustered index scan, even though there’s an index on one or more of the columns been searched on.

One of the more common reasons that this happens is because the index in question is not covering, and SQL has determined that the cost of doing the lookups to fetch the extra columns is higher than the cost of scanning the entire table.

If an index does not cover a query, then bookmark lookups are required to get the additional columns, bookmark lookups are run one row at a time, and are seeks on the clustered index. Hence it’s clear that bookmark lookups on a large number of rows are exceedingly expensive and that is why SQL will switch to a clustered index/table scan when lookups are required on a significant percentage of the rows in the table.

So, what constitutes a significant percentage of the rows in the table? 50%? 20%? 10%?

(more…)