Or “What’s the fastest way to count the rows?”
It’s a fairly common to need to know the number of rows in a table, the number of rows that match a certain condition or whether or not there are any rows that match a condition. There’s also a number of ways of doing so, some better than others. The problem being that counting is not a cheap operation, especially on big tables. It’s not as bad as a sort, but it still can be expensive.
So, given that, let’s take a look at some of the ways.
Querying the metadata
If all that’s needed is the number of rows in the table, and it’s not 100% important that the value be completely accurate all the time, the system metadata can be queried. In SQL 2000 and below, that info was in sysindexes. In 2005 and higher it’s been moved into sys.partitions.
SELECT OBJECT_NAME(object_id) AS TableName, SUM(rows) AS TotalRows FROM sys.partitions WHERE index_id in (0,1) AND object_id = OBJECT_ID('TableName') GROUP BY object_id
The advantage of this approach is that it is fast. Since it’s not actually counting anything and, in fact, isn’t even accessing the table that’s being counted, it’s the fastest way to get the count of rows in the table.
The disadvantage is it can only get the number of rows in the table and cannot consider any criteria at all. It also may not be 100% accurate, depending how and when the table’s rowcount metadata is updated by the SQL engine.
Checking for existence of rows
If all that’s needed is to know whether or not a row exists for a certain criteria, use EXISTS rather than counting the entire number of rows and checking that it’s greater than 0. It’s (slightly) quicker and it uses fewer IOs.
CREATE TABLE TestingCounts ( ID INT IDENTITY PRIMARY KEY, AChar CHAR(1), SomePaddingColumn CHAR(400) ) INSERT INTO TestingCounts (SomePaddingColumn, AChar) SELECT TOP (100000) a.name, LEFT(b.name,1) FROM sys.columns a CROSS JOIN sys.columns b GO SET STATISTICS IO ON SET STATISTICS TIME ON GO DECLARE @RowCount INT SELECT @RowCount = COUNT(*) FROM TestingCounts WHERE ID>50000 IF (@RowCount > 0) PRINT 'Count - Rows Exist' IF EXISTS (SELECT 1 FROM TestingCounts WHERE ID>50000) PRINT 'Exists - Rows Exist'
Count:
Table ‘TestingCounts’. Scan count 1, logical reads 2645, physical reads 0.
SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 13 ms.
Exists:
Table ‘TestingCounts’. Scan count 1, logical reads 3
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Count(*) vs Count(ColumnName)
The difference between these two is not always understood, and I’ve often seen hints and tips that say that COUNT(ColumnName) is better than COUNT(*). That advice is wrong and it’s possible that it comes from the (correct) advice that SELECT <Column List> is better than SELECT *.
In truth, COUNT(*) means to count the number of rows in the resultset. COUNT(ColumnName) means to count the number of rows in the resultset where that column is not null. Hence, COUNT(ColumnName) will return a different result to COUNT(*) if the column that’s used is nullable and contains any nulls.
So what does that mean for the performance of the two?
Because COUNT(*) just means count the rows then, assuming that there are no other criteria in the query, to satisfy it SQL will find the smallest index that exists on the table and scan the leaf pages to count the rows.
COUNT(ColumnName) requires that the column specified be checked to see if there are any null values. If the column that’s specified is defined as NOT NULLable, them SQL treats it just like a COUNT(*). If the column is nullable then, regardless or whether or not it contains any NULLs, it has to be checked. That means that to evaluate a COUNT(ColumnName), SQL must either scan an index that has that column in it or it must do a full table scan.
This means that, at best, a COUNT(Column) can be as fast as a COUNT(*), but it cannot be faster. It also means that COUNT(*) may return a different value to COUNT(Column)
-- using the same table created above CREATE INDEX idx_AChar ON TestingCounts (Achar) GO SELECT COUNT(*) FROM TestingCounts WHERE ID%3 = 0 SELECT COUNT(SomePaddingColumn) FROM TestingCounts WHERE ID%3 = 0
Count(*):
Table ‘TestingCounts’. Scan count 1, logical reads 138.
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 15 ms.
COUNT(SomePaddingColumn):
Table ‘TestingCounts’. Scan count 1, logical reads 5285
SQL Server Execution Times:
CPU time = 47 ms, elapsed time = 47 ms.
Of course, if there are other conditions in the query or a group by, the situation becomes rather more complicated as to what indexes will be used. The main point though remains true. Because COUNT(ColumnName) has to check that column for NULL values and COUNT(*) does not, COUNT(ColumnName) cannot be the faster option.
Nice summary, Gail. Especially the Count(*) vs. Count(ColumnName) section. I think I knew it, but not as well as you expressed it.
Well put. One thing I would change is the ‘(slightly)’ in “Checking for Existence”. On large tables, this will likely make a big difference. If the column is not indexed, ‘if exists’ requires only a scan until it reaches the first record with the value, instead of scanning the whole table. If it the column is indexed but there are a lot of duplicates, ‘if exists’ avoids having to count all the leaf level nodes of the index with the given value.
After seeing the below query and comment
SELECT OBJECT_NAME(object_id) AS TableName, SUM(rows) AS TotalRows
FROM sys.partitions
WHERE index_id in (0,1)
AND object_id = OBJECT_ID(’TableName’)
GROUP BY object_id
a doubt came to mind. Since you said “It also may not be 100% accurate, depending how and when the table’s rowcount metadata is updated by the SQL engine”.My doubt is that if we run the above query immediately after updating IO STATISICS, would it not be accurate ?
IO Statistics? You mean after updating column statistics?
The data in sys.partitions is not dependent on column statistics and is not affected by column statistics.
Pingback: SQL Server and Cloud Links for the Week | Brent Ozar - SQL Server DBA
Instead of using exists, why not just select top 1 * into a CTE and count that? May not be as good performance wise, would need to test it 🙂
“If all that’s needed is to know whether or not a row exists for a certain criteria, use EXISTS rather than counting the entire number of rows and checking that it’s greater than 0.”
The example uses:
IF EXISTS (SELECT 1 FROM TestingCounts WHERE ID>50000)
PRINT ‘Exists – Rows Exist’
If you ever needed to implement this in a query, you could do the same thing. This is also known as a semi-join.
e.g.
SELECT o.OrgUnitName
FROM OrgUnit AS o
WHERE o.OrgUnitID IN
(
SELECT p.OrgUnitID FROM Person AS p
WHERE p.OrgUnitID = o.OrgUnitID
)
Oops… my above example query is correct, but with the wrong syntax!
Try:
SELECT o.OrgUnitID
FROM OrgUnit AS o
WHERE EXISTS
(
SELECT p.OrgUnitID FROM Person AS p
WHERE p.OrgUnitID = o.OrgUnitID
)
Pikes: Why go to all the trouble of declaring a CTE and then counting that. You’re not gaining anything by doing so.
Thats what I love about powershell, the number of rows is essentially a property of a table. Instantiate the database as an object, grab the tables collection and select the name of the table and the rowcount. Want the dataspace and indexspace usage for each table? Add these to the select list in the pipeline. Want to sort by rowcount or dataspaceused? Add a sort-object to the pipeline. Done.
Yes you are right! I think your solution is the most elegant, by far.
David, how does powershell get the row count? By actually counting the rows in the table or by using the metadata?
Do you know why a SELECT COUNT(column1) FROM table1 takes about twice as much time to complete than a SELECT COUNT(column1) FROM table1 WHERE column1 IS NOT NULL? I looked at the execution plans but couldn’t find out why there is such a big difference. In my case 97% of the plan is due to an Index Scan in both cases, with the same values in the details.
Excelente información para los que somos movatos
great article I loved this one and was brought back it today, when trying to help someone with a problem at SSC:
http://www.sqlservercentral.com/Forums/Topic714996-338-3.aspx
Great article!
I think that COUNT(1) will also be treated in the same way as COUNT(*). Is it correct?
It should be.
Gail,
Instead of “Select 1 from Tablename” IF we use “Select Top 1 ” will it decrease performance?
Sm
Why don’t you test it out and tell me?
ok, let me test it ‘n Come back to you Gail..again..see you…then
Sm
Gail, Now just I have tested but all are same. may be its my dev box. but can you tell me how to calculate CPU time = 16 ms, elapsed time = 15 ms.. And also I tried with mine query Top 1 statements but no difference i found. Can u tell me y so?
Sm.
Gail,
Thanks a Lot for the Wonderful Explanation..
There is a wrong assumption that sys.partitions uses statistics between many DBAs I talked to.
I’m happy to read that you have stated that it is not based on statistics.