Earlier this year I had a look at a query pattern that I often see on forums and in production code, that of the Distinct within an IN subquery. Today I’m going to look at a similar patters, that being the use of TOP 1 within an EXISTS subquery.
Three tests. First a straightforward exists with no correlation (no where clause linking it to an outer query). Second, an exists with a complex query (one with a non-sargable where clause and a group by and having). Third an exists subquery correlated to the outer query.
Table structures are nice and simple, in fact, for ease I’m going to use the same tables as I did back on the exists, in and inner join tests. Code to create and populate the tables it attached to the end of the post.
First up, a simple exists query, in an IF, just to be different.
IF EXISTS (SELECT 1 FROM PrimaryTable_Medium) PRINT 'Exists' IF EXISTS (SELECT TOP (1) 1 FROM PrimaryTable_Medium) PRINT 'Exists too'
For a benchmark, a SELECT 1 FROM PrimaryTable_Medium has the following IO characteristics
Table ‘PrimaryTable_Medium’. Scan count 1, logical reads 89, physical reads 0.
SQL Server Execution Times:
CPU time = 15 ms, elapsed time = 510 ms.
Ignore the elapsed time, that’s likely mostly from displaying the records. I’m going to focus mostly on the CPU and IO.
Execution plans of the two exists variations are absolutely identical.
The index operators are scans because there is no way they could be anything else, there’s no predicate so a seek is not possible. That said, it’s not a full index scan. The properties of the Index Scan show 1 row only (actual and estimated). So SQL did not read the entire index, just enough to evaluate the EXISTS, and that’s what it did in both cases. IO stats confirm that.
Table ‘PrimaryTable_Medium’. Scan count 1, logical reads 2, physical reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
ExistsTable ‘PrimaryTable_Medium’. Scan count 1, logical reads 2, physical reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Exists too
Two reads in each case and a CPU time so low it’s immeasurable. A full scan of the index takes 89 reads (as shown earlier) so it should be clear that SQL read a minimal amount of data, both when the TOP was specified and when it wasn’t.
On to a more complex test. Again, using EXISTS within an IF
IF EXISTS ( SELECT 1 FROM PrimaryTable_Medium WHERE RIGHT(SomeColumn,2) > 'HH' GROUP BY LEFT(SomeColumn,1) HAVING COUNT(*) > 1 ) PRINT 'Exists Again' IF EXISTS ( SELECT TOP (1) 1 FROM PrimaryTable_Medium WHERE RIGHT(SomeColumn,2) > 'HH' GROUP BY LEFT(SomeColumn,1) HAVING COUNT(*) > 1 ) PRINT 'Still Exists'
If I run just the SELECT 1 alone, 10 rows are returned.
Execution plans are a lot more complex, pretty much to be expected. They’re still identical, as are the IOs and CPU time.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘PrimaryTable_Medium’. Scan count 1, logical reads 89, physical reads 0..SQL Server Execution Times:
CPU time = 31 ms, elapsed time = 27 ms.
Exists AgainTable ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘PrimaryTable_Medium’. Scan count 1, logical reads 89, physical reads 0.SQL Server Execution Times:
CPU time = 31 ms, elapsed time = 45 ms.
Still Exists
This time the index scan was a scan of the entire index (89 pages). Because of the aggregation and the having, SQL couldn’t abort the scan once it had what it needed. All rows needed to be returned so that the aggregation and subsequent filter could be done.
One last test, with an EXISTS subquery.
I’m going to create a secondary table that has one 20% of the values for SomeColumn in PrimaryTable_Medium, but has each one repeated 500 times for a total of 615000 rows.
SELECT ID , SomeColumn FROM dbo.PrimaryTable_Medium pm WHERE EXISTS (SELECT 1 FROM dbo.Secondary s WHERE pm.SomeColumn = s.SomeColumn) SELECT ID , SomeColumn FROM dbo.PrimaryTable_Medium pm WHERE EXISTS (SELECT TOP(1) 1 FROM dbo.Secondary s WHERE pm.SomeColumn = s.SomeColumn)
Again, the execution plans are absolutely identical
So, for that matter, are the execution statistics
Table ‘PrimaryTable_Medium’. Scan count 1, logical reads 22, physical reads 0.
Table ‘Secondary’. Scan count 1, logical reads 1605, physical reads 0.SQL Server Execution Times:
CPU time = 171 ms, elapsed time = 443 ms.Table ‘PrimaryTable_Medium’. Scan count 1, logical reads 22, physical reads 0.
Table ‘Secondary’. Scan count 1, logical reads 1605, physical reads 0.SQL Server Execution Times:
CPU time = 172 ms, elapsed time = 387 ms.
So, in conclusion, is there any point in adding a TOP to an exists subquery? Does it persuade SQL to return only the minimum information needed to satisfy the Exists?
No to both. The Exists operator itself tries to retrieve jus the absolute minimum of information, so the addition of TOP 1 does nothing except add 5 characters to the query size.
Good Stuff, I’ve been trying to educate our developers on this very topic over the past few months. I think they believed me anyway, but if I continue to see the pattern I’ll send them here…
It’s for this reason that I use EXISTS(SELECT 1/0 FROM …) in my demo/debugging scripts. Shows very quickly that that portion of the statement never even gets evaluated.
EXISTS queries are interesting… Like Brandon in his comment, I did some goofy things with them, showing that whatever information you SELECT in an EXISTS subquery makes no difference at all (because all the optimizer cares about are ROWS and not individual columns):
http://bradsruminations.blogspot.com/2009/09/age-old-select-vs-select-1-debate.html
–Brad
erm… it adds six characters – you missed the tailing (or leading) whitespace…
I suppose this on SQL 2005 or 2008 because on old versions there was a diff. and this is a habit and I personally do not see anything wrong with it. but this is just me …
I don’t have a 2000 instance any longer. I don’t even have a 2005 instance any longer.
Other than a waste of typing and potentially misleading people who maintain the code later there’s nothing wrong, but there’s nothing useful either.
These sort of uses find their root in small anomalies that these tests cannot expose.
For example, I once has a query where a top 1 min(x) worked way faster, in the orders of tens of times then without the top. It got me so confused, I had to made sure the top 1 worked as I thought it should and not affect the final outcome at all.
It worked at it should have, but still the speedup was insane. I tried using the construct in other cases, just joping it would make those faster too, to no avail.
It problably all boils down to the way the optimizer works, allowing just a limited series op plan evaluations before setteling on the one it deems good enaugh. Adding some functional needless complexity can increase the time the optimizer allocated to itself for finding a good enaugh plan and thus result in a better plan.
The same goes for hiding loginc in inline table functions. Sometimes they slow things down, other times they improve things quite a bit.
Pingback: Something for the Weekend – SQL Server Links 15/04/11 | John Sansom - SQL Server DBA in the UK
Top (1) used in the Exists query makes difference when there is no where clause, it does not have any impact with joins (i.e. join will also make the difference when compare to without Top (1)). But when the where clause comes into the picture SQL server has to scan the rows in order to check for the matching rows and after that only it will return one row out of the matching rows. Hence Top (1) does not improve the performance in that case.
My first test in this post was without a where clause. No difference there either.