IN vs INNER JOIN
Often in forum threads discussing query performance I’ll see people recommending replacing an INNER JOIN with an IN or EXISTS (or recommending replacing an IN or EXISTS with an INNER JOIN) for performance reasons. I’ve previously looked at how the IN and EXISTS compared, now I’m going to investigate and see how IN compares with the join.
One very important thing to note right off is that they are not equivalent in all cases.
An inner join between two tables does a complete join, it checks for matches and returns rows. This means, if there are multiple matching rows in the second table, multiple rows will be returned. Also, when two tables are joined, columns can be returned from either. As a quick example:
DECLARE @BigTable TABLE ( id INT IDENTITY PRIMARY KEY, SomeColumn CHAR(4), Filler CHAR(100) ) Insert into @BigTable(SomeColumn) Values (1) Insert into @BigTable(SomeColumn) Values (2) Insert into @BigTable(SomeColumn) Values (3) Insert into @BigTable(SomeColumn) Values (4) Insert into @BigTable(SomeColumn) Values (5) DECLARE @SomeTable TABLE (IntCol int) Insert into @SomeTable (IntCol) Values (1) Insert into @SomeTable (IntCol) Values (2) Insert into @SomeTable (IntCol) Values (2) Insert into @SomeTable (IntCol) Values (3) Insert into @SomeTable (IntCol) Values (4) Insert into @SomeTable (IntCol) Values (5) Insert into @SomeTable (IntCol) Values (5) SELECT * FROM @BigTable b INNER JOIN @SomeTable s ON b.SomeColumn = s.IntCol
This returns 7 rows and returns columns from both tables. Because the values in @SomeTable are duplicated, the matching rows from BigTable are returned twice.
With an IN, what is done is a semi-join, a join that checks for matches but does not return rows. This means if there are multiple matching tables in the resultset used for the IN, it doesn’t matter. Only one row from the first table will be returned. Also, because the rows are not returned, columns from the table referenced in the IN cannot be returned. As a quick example
DECLARE @BigTable TABLE ( id INT IDENTITY PRIMARY KEY, SomeColumn CHAR(4), Filler CHAR(100) ) Insert into @BigTable(SomeColumn) Values (1) Insert into @BigTable(SomeColumn) Values (2) Insert into @BigTable(SomeColumn) Values (3) Insert into @BigTable(SomeColumn) Values (4) Insert into @BigTable(SomeColumn) Values (5) DECLARE @SomeTable TABLE (IntCol int) Insert into @SomeTable (IntCol) Values (1) Insert into @SomeTable (IntCol) Values (2) Insert into @SomeTable (IntCol) Values (2) Insert into @SomeTable (IntCol) Values (3) Insert into @SomeTable (IntCol) Values (4) Insert into @SomeTable (IntCol) Values (5) Insert into @SomeTable (IntCol) Values (5) SELECT * FROM @BigTable WHERE SomeColumn IN (Select IntCol FROM @SomeTable)
This returns 5 rows and only columns from BigTable.
So, that said, how does the performance of the two differ for the cases where the results are identical (no duplicates in the second table, no columns needed from the second table)? For that, I’m going to need larger tables to play with. (more…)