Index columns, selectivity and equality predicates

Or “Which column goes first?”

There’s a common piece of advice given about columns in an index key that says that the most selective column should go first. I’m not going to say that’s incorrect, because it’s not. The problem is that it’s often given without any explanation as to why the most selective column should go first, nor are the other considerations for index key order mentioned.

This can lead to misunderstandings like, in the extreme case, where one person after hearing that advice went and added the primary key column as the leading column of every single nonclustered index (because it’s highly selective), and then wondered why his database performance decreased dramatically.

The comment about selectivity is because of the way SQL keeps statistics on indexes (see my post on statistics for more info on what they are). SQL only keeps the histogram for the first column of the index. That means that it only knows the actual distribution of values of the first column. If the first column is not selective, the index may not be used. However, that’s not the whole story.

SQL also, in addition to the histogram, keeps density values for all of the left-based subsets of the index keys. So, for a 3 column index key, SQL knows the density of the first column, of the first and second and of all three. The density is, in a nutshell, a value that shows how unique the set of columns is. It’s 1/(distinct values). The value can be seen for any index using DBCC Show_Statistics with the DENSITY_VECTOR option.

This means, while SQL only knows the actual data distribution of the first column, it does know, on average, how many rows will be returned by an equality match on any left-based subset of the index keys

So, what’s my rule for the order of columns in an index key? Put the most selective columns first, when all other considerations are equal.

What does that mean? Well, let’s consider a hypothetical table:

CREATE TABLE ConsideringIndexOrder (
ID INT IDENTITY,
SomeString VARCHAR (100),
SomeDate DATETIME DEFAULT GETDATE()
);

Say this table has 10000 rows. It’s a heap, no clustered index. Let’s further say that there are 100 different values for Somestring, and 5000 different values for SomeDate. ID, since it’s an identity, is unique.

There’s a single nonclustered index on all three columns, in the order ID, SomeDate, SomeString.

Considering equalities first, this index can only used to seek for queries of the following forms:
… WHERE ID = @ID AND SomeDate = @dt AND SomeString = @str
… WHERE ID = @ID AND SomeDate = @dt
… WHERE ID = @ID

In other words, it’s only useful when the columns in the where/join are a left-based subset of the index key columns.

It cannot be used to seek on by a query that only filters on SomeDate, only on SomeString or the combination of the two. It’s like using the phonebook (which is an index ordered by surname, first initial) to find all the people whose firstname starts with M. It’s possible, but only by reading the entire book, i.e. by doing a scan.

It’s useless to have the most selective column of the index on the left if very few queries filter on it. Queries that don’t filter on it, but do filter on the other columns of the index will have to scan, and scans are expensive.

If it turns out that there’s more than one column that can be first, based on the queries that run against the table, or if all queries do equality matches on two or more columns, then put the most selective one first so that SQL has more chance to know that the index is useful. Otherwise, put the columns that are more often filtered on earlier in the index, so that the index can be used by more queries.

Let’s take a look at some query scenarios based on the hypothetical table above to see how that index will be used.

Scenario 1: Equality match on the ID column

This one’s the simplest. Since it’s a direct equality match on the leading column of an index, it’s a single seek operation to find the row.

Scenario 2: Equality match on the ID and Date columns

This one’s also simple. Since it’s a direct equality match on a left-based subset of the index columns, it’s a single seek operation to find the row.

Scenario 3: Equality match on the ID and String columns

This one’s a little harder. Only the filter on ID can be done as part of the index seek because the string column is not the second column in the index. The date is, and this query’s not filtering on the date. Hence this will require a seek operation on ID and then the string column checked to see if it matches. It’s no longer a single seek operation, even though the second predicate is done by the seek operator. SQL has to seek to find the matching ID and then compare the value of the string column to see if it matches.

Scenario 4: Equality match on the Date and String Columns

In this case, SQL can’t do a seek at all. The leading column of the index is not used in the where clause and, as such, the only way to satisfy this query is to scan. In fact, SQL decides to do a table scan and evaluate each of the rows in the table against the values specified for the date and string columns

I think that’s pretty much enough on indexes and equality predicates. I know I’ve probably muddied the water more than clarifying it, but I hope that I’ve given at least someone a better idea of how indexes work and are used for equality matches. Either later this week or next week (work permitting) I’ll take a look at some more complicated scenarios, that of inequality predicates

19 Comments

Pingback: Choosing the order of columns in an index « Home of the Scary DBA
Michelle Ufford 21 January 2009 at 01:46

I was actually just wondering about this topic the other day, so imagine my delight when I saw your latest blog post. Thanks, Gail, for another excellent article!

Reply ↓
Pingback: SQL Server and Cloud Links for the Week | Brent Ozar - SQL Server DBA
Lian 21 April 2009 at 12:50

As always, time spent reading your posts is time spent well.

Reply ↓
Gail 21 April 2009 at 13:34

Glad you find them useful.

Reply ↓
Degraft 29 May 2009 at 16:16

Gail you are great. I’m just an index beginner and your article has given me enough insight. Thanks so much.

Reply ↓
Gail 31 May 2009 at 15:19

Glad you liked it.

Reply ↓
Bruce W Cassidy 9 July 2009 at 23:35

Gail, as always, a good discussion on the topic.

Reply ↓
Glenn 4 February 2010 at 00:57

I’ve been using SQL Server for about 5 years, and have found that every resource I’ve encountered on indexing always sort of glosses over how to actually do it. I was still left asking “but what are the rules?”. Your 24 hours of Pass talk on Effective Indexing was the first time I took a page of notes that actually summarised some key rules – technicques and rules of thumb for equalities vs inequalities, left based sets, group by and order by etc. Thanks for laying out some really useful indexing guidlines!

Reply ↓
Gail 4 February 2010 at 11:11

Pleasure.

There’s an introductory series on Indexing on SQLServerCentral that you may find useful. http://www.sqlservercentral.com/articles/Indexing/68439/

Reply ↓
frank 9 January 2011 at 09:57

great article . you clarified most of my doubts about order by columns in indexes.

scenario 4 should be a index scan rather than table scan

Reply ↓
Divine Flame 6 September 2012 at 13:13

Very good article & very well explained. Thanks for sharing such excellent information reagrding indexes, Gail.

Reply ↓
Vimal Singh Rathore 18 December 2012 at 09:12

Nice article concept explained with ease.

Reply ↓
RR 1 July 2013 at 12:43

HI Gail,

I have created created partitioning with non clustered index on date time column. If verify partitioning by using object explorer
(storage in particular table properties ), it showing , table is not partitioned. How is the best way to find partitioned or not.
thanks.
RR

Reply ↓
Gail (Post author)2 July 2013 at 16:31

Please post your question in one of the SQL forums. This post has nothing to do with partitioning and a blog’s comments are not the appropriate place for random questions

Reply ↓
OA 6 May 2015 at 13:28

I was wondering if you could answer a few queries I have, please? You said in this post:

There’s a single nonclustered index on all three columns, in the order ID, SomeDate, SomeString.

Considering equalities first, this index can only used to seek for queries of the following forms:
… WHERE ID = @ID AND SomeDate = @dt AND SomeString = @str
… WHERE ID = @ID AND SomeDate = @dt
… WHERE ID = @ID

In other words, it’s only useful when the columns in the where/join are a left-based subset of the index key columns.

If you had a WHERE predicate that includes the first and last column of the index, like this,

… WHERE ID = @ID AND SomeString = @str

Would the optimiser use a nonclustered index seek (assuming that each column is highly selective and the rows returned are highly selective)?

Also, if you have a nonclustered index containing, say, 7 highly selective columns and have a WHERE predicate for the 2nd and the 3rd columns, will nonclustered index seek be used?

Thanks.

Reply ↓
Gail (Post author)7 May 2015 at 16:48

Your first example, yes that will be a seek on the ID column only, there will be a secondary predicate to apply the filter on SomeStr.

Again, not selectivity, it’s about the order of columns in the index.

Your second example, no, all you’ll get there is an index scan. And again nothing to do with the selectivity, just the order of columns in the index.

Reply ↓
OA 7 May 2015 at 17:02

Thanks Gail, that was really helpful.

Reply ↓
sateesh 29 July 2015 at 14:07

Thanks Gail

Reply ↓

Index columns, selectivity and equality predicates

Share this:

19 Comments

Leave a Comment Cancel reply