Index columns, selectivity and inequality predicates

So, following on from my post last week, I’m going to take a look at how selectivity and index column order affect inequality predicates.

One thing to note straight off is that the selectivity of a column is much less important for inequality predicates than it was for equality. For equality predicates, the selectivity alone can give a reasonable idea of the number of rows a particular predicate will return. That’s not the case with inequalities. Also, with inequality predicates, the order of columns in the index becomes very important.

One of the most important considerations with inequality predicates is the number of rows that the predicate will return. An identity column may be highly selective, but if the filter is for all rows > 0 and the identity values start t one, then an index on that column is not going to be very useful.

The other consideration when there are inequality predicates is that only that column and columns to the left of it in the index key can be used for index seeks. Any columns to the right of the column with the inequality is no longer eligible for seeking.

To explain with an example, consider our hypothetical table from the previous post (with one small change):

CREATE TABLE ConsideringIndexOrder (
SomeString VARCHAR (100),

The same as previously, there’s a single nonclustered index on all three columns, in the order ID, SomeDate, SomeString.

If there’s an inequality predicate, then then the index is only fully seekable for the following queries
…  WHERE ID = @ID AND SomeDate = @dt AND SomeString > @str
…  WHERE ID = @ID AND SomeDate > @dt


Index columns, selectivity and equality predicates

Or “Which column goes first?

There’s a common piece of advice given about columns in an index key that says that the most selective column should go first. I’m not going to say that’s incorrect, because it’s not. The problem is that it’s often given without any explanation as to why the most selective column should go first, nor are the other considerations for index key order mentioned.

This can lead to misunderstandings like, in the extreme case, where one person after hearing that advice went and added the primary key column as the leading column of every single nonclustered index (because it’s highly selective), and then wondered why his database performance decreased dramatically.

The comment about selectivity is because of the way SQL keeps statistics on indexes (see my post on statistics for more info on what they are). SQL only keeps the histogram for the first column of the index. That means that it only knows the actual distribution of values of the first column. If the first column is not selective, the index may not be used. However, that’s not the whole story.

SQL also, in addition to the histogram, keeps density values for all of the left-based subsets of the index keys. So, for a 3 column index key, SQL knows the density of the first column, of the first and second and of all three. The density is, in a nutshell, a value that shows how unique the set of columns is. It’s 1/(distinct values). The value can be seen for any index using DBCC Show_Statistics with the DENSITY_VECTOR option.

This means, while SQL only knows the actual data distribution of the first column, it does know, on average, how many rows will be returned by an equality match on any left-based subset of the index keys

So, what’s my rule for the order of columns in an index key? Put the most selective columns first, when all other considerations are equal.


Seek or scan?

One very common question that I see on the forums is on index seeks and index scans. A query is resulting in a table/clustered index scan, even though there’s an index on one or more of the columns been searched on.

One of the more common reasons that this happens is because the index in question is not covering, and SQL has determined that the cost of doing the lookups to fetch the extra columns is higher than the cost of scanning the entire table.

If an index does not cover a query, then bookmark lookups are required to get the additional columns, bookmark lookups are run one row at a time, and are seeks on the clustered index. Hence it’s clear that bookmark lookups on a large number of rows are exceedingly expensive and that is why SQL will switch to a clustered index/table scan when lookups are required on a significant percentage of the rows in the table.

So, what constitutes a significant percentage of the rows in the table? 50%? 20%? 10%?


What is fragmentation?

It’s common knowledge that SQL indexes become fragmented over time and should be rebuilt to remove that fragmentation. But what exactly is fragmentation?

First, a little bit on index structure.

The index key defines the logical order of the index leaf pages. Each page has two pointers, one to the previous page and one to the next page in the index.

A hypothetical index on an integer column

In an ideal situation, the logical order of the pages would correspond to the physical order, and a page with higher index key values  would have a higher page number (corresponding to physical position in the file) than a page with lower index key values.

If that is the case for all pages in the index leaf level, then the index has no fragmentation. Unfortunatly, that is not normally the case. Inserts and updates to the index can cause page splits. A page split moves half the rows on one of the index leaf pages to a new page and then linkes that page into the chain. It’s unlikely that a free page will be in the correct physical logation in the file and hence after the page split, the physical and logical ordering of the pages will no longer be the same.


Do wide indexes slow queries down?

I got asked this question during one of my TechEd sessions. It was a bit complex to answer there and then, so I answered by mail a few days later, and I thought it would make a good blog post if expanded a bit more.

Assume a table with 5 columns (imaginatively called A, B, C, D and Z). A and B are ints, C is a datetime and D is a char(50). Let’s assume that Z is an int, identity, primary key and has the clustered index

If I put an index on A and B, the size of the index key (ignoring headers and other overhead) is 8 bytes. Add in the clustering key for row location and each index leaf record in 12 bytes wide. That means that (at 100% fill factor) there are around 660 index rows per leaf page. Since I’m ignoring headers, this rough calc will not exactly match reality, but it’s close enough.

If the table has 100000 rows, the the leaf level of the index consists of 152 pages.

To calculate the number of pages above, one should know that for each leaf page, the next level higher has one index row in – the first value on the page. Hence, for 152 leaf pages, the level above will have 152 index rows as 12 bytes each totalling 1824 bytes. This will fit on one page, so the level above the leaf is the index root, and this index has only two levels.

To seek a single row from this index hence requires 2 page reads.


Indexes for aggregates

It’s well known that indexes on columns used in where clause and for joins is a good thing in SQL, but what about other places. How about on aggregates?

Consider a simple table with an amount and a customerID. It’s a common requirement to calculate the total amount that each customer has paid. No conditions are enforced, so this would seem like a place where an index won’t help. Well, let’s see. (sample code at end)

The clustered index (and hence the physical order of the rows) is on the identity column.Take the following query.
SELECT CustomerID, SUM(Amount) FROM Payments group by customerID