This one’s come up a few times recently, so I’ll take a look at it.
The EXISTS and IN clauses at first glance look fairly similar. They both use a subquery to evaluate rows, but they do it in a slightly different way
IN does a direct match between the column specified before the IN keyword and the values returned by the subquery. When using IN there can only be a single column specified in the select clause of the subquery
Let’s have a look at a quick example
DECLARE @SomeTable (IntCol int) Insert into @SomeTable (IntCol) Values (1) Insert into @SomeTable (IntCol) Values (2) Insert into @SomeTable (IntCol) Values (3) Insert into @SomeTable (IntCol) Values (4) Insert into @SomeTable (IntCol) Values (5) SELECT * FROM BigTable WHERE SomeColumn IN (Select IntCol FROM @SomeTable)
So this query returns all the rows from BigTable where SomeColumn has any of the values returned by the subquery, that is 1,2,3,4 or 5
But what if there were duplicate values returned by the subquery?
Well, it actually doesn’t matter. All that SQL is looking for is what values the subquery returns to process the filter. It’s not joining the two resultsets together so it makes no difference to the results if there are duplicate values returned by the subquery.
To put it more technically, SQL’s doing a semi-join, a join that can only eliminate or qualify rows from the first table, but cannot duplicate them.
So that’s IN. What about EXISTS
Exists doesn’t check for a match, it doesn’t care in the slightest what values are been returned from the expression, it just checks for whether a row exists or not. Because of that, if there’s no predicate in the WHERE clause of the subquery that compares rows in the subquery with rows in the outer query, EXISTS will either return true for all the rows in the outer query or it will return false for all the rows in the outer query
DECLARE @SomeTable (IntCol int) Insert into @SomeTable (IntCol) Values (1) Insert into @SomeTable (IntCol) Values (2) Insert into @SomeTable (IntCol) Values (3) Insert into @SomeTable (IntCol) Values (4) Insert into @SomeTable (IntCol) Values (5) SELECT * FROM BigTable WHERE EXISTS (Select IntCol FROM @SomeTable)
This will also return every single row in BigTable, because Select IntCol FROM @SomeTable returns 5 rows and hence the EXISTS predicate is always true.
Hence, to use EXISTS to do the same kind of thing as IN, there must be a correlation predicate within the subquery
DECLARE @SomeTable (IntCol int) Insert into @SomeTable (IntCol) Values (1) Insert into @SomeTable (IntCol) Values (2) Insert into @SomeTable (IntCol) Values (3) Insert into @SomeTable (IntCol) Values (4) Insert into @SomeTable (IntCol) Values (5) SELECT * FROM BigTable bt WHERE EXISTS (Select IntCol FROM @SomeTable st WHERE bt.SomeColumn = st.IntCol)
Now this will behave like the IN because it’s checking for matching rows and only returning true when there is a match. This will return all the rows from BigTable where SomeColumn has values 1,2,3,4 or 5 because those are the
Exists is better for when comparisons are needed on two or more columns. For eg, this cannot be done easily with an IN
DECLARE @SomeTable (IntCol int, charCol char(1)) Insert into @SomeTable (IntCol, charCol) Values (1, 'a') Insert into @SomeTable (IntCol, charCol) Values (2, 'a') Insert into @SomeTable (IntCol, charCol) Values (3, 'a') Insert into @SomeTable (IntCol, charCol) Values (4, 'b') Insert into @SomeTable (IntCol, charCol) Values (5, 'b') SELECT * FROM BigTable bt WHERE EXISTS (Select IntCol FROM @SomeTable st WHERE bt.SomeColumn = st.IntCol AND bt.SomeOtherColumn = st.charCol)
So that covers how they work, but how do they perform in comparison with each other? To answer that question, first I need some fairly large tables.
Create Table BigTable ( id int identity primary key, SomeColumn char(4), Filler char(100) ) Create Table SmallerTable ( id int identity primary key, LookupColumn char(4), SomeArbDate Datetime default getdate() ) INSERT INTO BigTable (SomeColumn) SELECT top 250000 char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) + char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8)) from master.sys.columns a cross join master.sys.columns b INSERT INTO SmallerTable (LookupColumn) SELECT DISTINCT SomeColumn FROM BigTable TABLESAMPLE (25 PERCENT) -- (3955 row(s) affected)
Let’s first try without indexes and see how EXISTS and IN compare.
-- Query 1 SELECT ID, SomeColumn FROM BigTable WHERE SomeColumn IN (SELECT LookupColumn FROM SmallerTable) -- Query 2 SELECT ID, SomeColumn FROM BigTable WHERE EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)
The first thing to note is that the execution plans are identical. Two clustered index scans and a hash join (right semi-join).
The IOs are also identical.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 15, physical reads 0.
So these two queries are executed by SQL in exactly the same way. No performance differences here.
Now, let me add indexes to both tables, on that join column and see what changes.
CREATE INDEX idx_BigTable_SomeColumn ON BigTable (SomeColumn) CREATE INDEX idx_SmallerTable_LookupColumn ON SmallerTable (LookupColumn)
With those created, I’m going to run the above two queries again. Again the execution plans of the two are identical, though the hash join and clustered index scans are gone, replaced by index scans, a stream aggregate and a merge join (inner join)
The IOs are again identical and the execution times very close.
Table ‘BigTable’. Scan count 1, logical reads 343, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 8, physical reads 0.
So IN and EXISTS appear to perform identically both when there are no indexes on the matching columns and when there are, and this is true regardless of whether of not there are nulls in either the subquery or in the outer table.
Next up, a look at how IN compares to Inner Join for the purposes of finding matching rows
Always to review the basics. Thanx!
Nice. Thanks.
great mail… thanks
Thanks, When the value is IN another table I always use EXISTS.
Pingback: SQL Server Central
Great series very helpful
Pingback: WHERE NOT EXISTS in the Twelve Days of SQL | Noel NOT NULL;
Thank you so much, you saved me before entering tomorrow’s exam & I had a great confusion in this part. Thanks again.
Pingback: Exists versus In – SQL | michaelmholt.com
Worth remembering the potential issue with NULLs:
http://www.tech-recipes.com/rx/57894/sql-server-three-valued-logic-not-clause-null-values/
Which I discussed in the post on NOT IN: https://www.sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/