It’s well known that indexes on columns used in where clause and for joins is a good thing in SQL, but what about other places. How about on aggregates?
Consider a simple table with an amount and a customerID. It’s a common requirement to calculate the total amount that each customer has paid. No conditions are enforced, so this would seem like a place where an index won’t help. Well, let’s see. (sample code at end)
The clustered index (and hence the physical order of the rows) is on the identity column.Take the following query.
SELECT CustomerID, SUM(Amount) FROM Payments group by customerID
Without any additional indexes, SQL will execute that as a hash match (aggregate) which comprises 63% of the query’s cost. That is because the data is not ordered by the grouped column, and hence cannot be simply be summed.
Now add an index on CustomerID, Amount and run the same query. Now, the aggregate can be satisfied just with the non clustred index and, more importantly, the data is in order of customerID, so just a stream aggregate is required and the data does not have to be resorted or run through a hashing algorithm. Moreover, the stream aggregate is only 15% of the query’s cost
CREATE TABLE Payments ( PaymentKey INT IDENTITY PRIMARY KEY, CustomerID int, InvoiceID int, PaymentDate datetime, Amount Numeric(17,5) ) ;WITH DataPopulate (RowNo, CustomerID ,InvoiceID ,ADate, Amount) AS ( SELECT 1 AS RowNo, FLOOR(RAND()*40) as CustomerID, FLOOR(RAND()*200) as InvoiceID, DATEADD(dd, FLOOR(RAND()*75454),'1753/01/01'), rand()*856542 AS Amount UNION ALL SELECT rowNo+1, FLOOR(RAND(RowNo*85455)*40) AS CustomerID, FLOOR(RAND(RowNo*85455)*500) AS InvoiceID, DATEADD(dd, FLOOR(RAND(RowNo*96322)*85454),'1753/01/01'), rand(RowNo*25411)*86542 AS Amount FROM DataPopulate WHERE RowNo<10000 ) INSERT INTO Payments (CustomerId, InvoiceID, PaymentDate, Amount) SELECT CustomerID ,InvoiceID ,ADate, Amount FROM DataPopulate OPTION (MAXRECURSION 10000)