IN vs INNER JOIN

Often in forum threads discussing query performance I’ll see people recommending replacing an INNER JOIN with an IN or EXISTS (or recommending replacing an IN or EXISTS with an INNER JOIN) for performance reasons. I’ve previously looked at how the IN and EXISTS compared, now I’m going to investigate and see how IN compares with the join.

One very important thing to note right off is that they are not equivalent in all cases.

An inner join between two tables does a complete join, it checks for matches and returns rows. This means, if there are multiple matching rows in the second table, multiple rows will be returned. Also, when two tables are joined, columns can be returned from either.  As a quick example:

DECLARE @BigTable TABLE (
 id INT IDENTITY PRIMARY KEY,
 SomeColumn CHAR(4),
 Filler CHAR(100)
)

Insert into @BigTable(SomeColumn) Values (1)
Insert into @BigTable(SomeColumn) Values (2)
Insert into @BigTable(SomeColumn) Values (3)
Insert into @BigTable(SomeColumn) Values (4)
Insert into @BigTable(SomeColumn) Values (5)


DECLARE @SomeTable TABLE (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM @BigTable b INNER JOIN @SomeTable  s ON b.SomeColumn = s.IntCol

This returns 7 rows and returns columns from both tables. Because the values in @SomeTable are duplicated, the matching rows from BigTable are returned twice.

With an IN, what is done is a semi-join, a join that checks for matches but does not return rows. This means if there are multiple matching tables in the resultset used for the IN, it doesn’t matter. Only one row from the first table will be returned. Also, because the rows are not returned, columns from the table referenced in the IN cannot be returned. As a quick example

DECLARE @BigTable TABLE (
 id INT IDENTITY PRIMARY KEY,
 SomeColumn CHAR(4),
 Filler CHAR(100)
)

Insert into @BigTable(SomeColumn) Values (1)
Insert into @BigTable(SomeColumn) Values (2)
Insert into @BigTable(SomeColumn) Values (3)
Insert into @BigTable(SomeColumn) Values (4)
Insert into @BigTable(SomeColumn) Values (5)

DECLARE @SomeTable TABLE (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM @BigTable
 WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

This returns 5 rows and only columns from BigTable.

So, that said, how does the performance of the two differ for the cases where the results are identical (no duplicates in the second table, no columns needed from the second table)? For that, I’m going to need larger tables to play with.

CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn CHAR(4),
Filler CHAR(100)
)

CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn CHAR(4),
SomeArbDate DATETIME DEFAULT GETDATE()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000 char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) + char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
FROM master.sys.columns a CROSS JOIN master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3819 row(s) affected)

That’s the setup done, now for the two test cases. Let’s first try without indexes and see how the INNER JOIN and IN compare. I’m selecting from just the first table to ensure that the two queries are logically identical. The DISTINCT used to populate the smaller table ensures that there are no duplicate rows in the smaller table.

SELECT BigTable.ID, SomeColumn
FROM BigTable
WHERE SomeColumn IN (SELECT LookupColumn FROM dbo.SmallerTable)

SELECT BigTable.ID, SomeColumn
FROM BigTable
INNER JOIN SmallerTable ON dbo.BigTable.SomeColumn = dbo.SmallerTable.LookupColumn

Something of interest straight away, the execution plans are almost identical. Not completely identical, but the only difference is that the hash join for the IN shows a Hash Match (Right Semi Join) and the hash join for the INNER JOIN shows a Hash Match (Inner Join)

In Vs Select 1

The IOs are the same and the durations are extremely similar. Here’s the IO results and durations for five tests.

IN

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 14, physical reads 0.

SQL Server Execution Times:
CPU time = 156 ms,  elapsed time = 2502 ms.
CPU time = 157 ms,  elapsed time = 2323 ms.
CPU time = 156 ms,  elapsed time = 2555 ms.
CPU time = 188 ms,  elapsed time = 2381 ms.
CPU time = 203 ms,  elapsed time = 2312 ms.

INNER JOIN

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 14, physical reads 0.

SQL Server Execution Times:
CPU time = 125 ms,  elapsed time = 2922 ms.
CPU time = 140 ms,  elapsed time = 2372 ms.
CPU time = 188 ms,  elapsed time = 2530 ms.
CPU time = 203 ms,  elapsed time = 2323 ms.
CPU time = 187 ms,  elapsed time = 2512 ms.

Now let’s try with some indexes on the join columns.

CREATE INDEX idx_BigTable_SomeColumn ON BigTable (SomeColumn)
CREATE INDEX idx_SmallerTable_LookupColumn ON SmallerTable (LookupColumn)

Now when I run the two queries, the execution plans are different, and the costs of the two are no longer 50%. Both do a single index scan on each table, but the IN has a Merge Join (Inner Join) and the INNER JOIN has a Hash Match (Inner Join)

InVsSelect 2

The IOs are still identical, other than the WorkTable that only appears for the Hash Join.

IN

Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 14, physical reads 0.

INNER JOIN

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 14, physical reads 0.

So what about the durations? Honestly it’s hard to say anything completely conclusive, the durations of both queries are quite small and they are very close. To see if there is any measurable different, I’m going to run each one 100 times, use Profiler to log the duration and CPU and then average the results over the 100 executions. While running this, I’m also going to close/disable everything else I can on the computer, to try and get reasonably accurate times.

IN

Average CPU: 130.
Avg duration: 2.78 seconds

INNER JOIN

Average CPU: 161.
Avg duration: 2.93 seconds

Now is that enough to be significant? I’m not sure. However, looking at those results along with the IO and execution plans, I do have a recommendation for In vs Inner Join

If all you need is to check for matching rows in the other table but don’t need any columns from that table, use IN. If you do need columns from the second table, use Inner Join.

I still intend to go over NOT IN and NOT EXISTS and, after this one, I also want to take a look at the LEFT JOIN with IS NULL check vs NOT EXISTS for when you want rows from Table1 that don’t have a match in Table 2.

28 Responses to “IN vs INNER JOIN”

  1. Gail,

    Interesting results as I would have thought that the INNER JOIN would have been better.

  2. Hey Gail

    Nice entry! :-)

    However, IN is a slightly dangerous when using another table within the sub-query.

    As long as there are no criteria put from outside in, it works like an EXISTS() and creates a fine semi-join.

    If there are any parameters put from outside into the sub-query, IN() ends up in a row based execution (in my tests). Using EXISTS() seems to stay with a semi-join, with or without providing parameters to the sub-query.

    Did my investigations after a post of Paul White about EXISTS() here:
    http://www.sqlservercentral.com/Forums/Topic815335-338-1.aspx
    … and further while a discussion on a (German) developer board here:
    http://www.mycsharp.de/wbb2/thread.php?threadid=79251

    Thanks for sharing
    Flo

  3. Curious you didn’t also check the performance of EXISTS and a sub-query, which I’ve found can often out-perform both IN and EXISTS.

  4. Sorry Aaron, my bad. However I did compare IN and EXISTS a while back and found them absolutely identical in plan and performance.

    http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/

  5. Jack, has to do with the joins used. I’m not sure why the inner join went for the hash join when both results sets were in order. I would have expected a merge join. Gut feel, if it had been a merge join, performance would have been identical. That’s why I’m hesitant to conclude that IN is faster.

    Might be an interesting experiment to try this again, when there’s a foreign key constraint in place. I suspect it’ll make a difference.

  6. Gail, I was going to disagree with you until the very end of the article and read:

    “If all you need is to check for matching rows in the other table but don’t need any columns from that table, use IN. If you do need columns from the second table, use Inner Join.”

    I just had to rewrite a query for a developer where the IN took 4 min to return the same columns that my JOIN did in sub 1 sec. I think that is the key. When you have to do more than return matching rows the JOIN should start to outperform IN.

    You did a great job explaining everything. I am looking forward to the articles you have planned.

  7. Florian, I think the key is (as always) to keep it as simple as possible. IN is for single column comparisons against a sub query. When there are multiple columns that need to be compared between the outer and the subquery, that’s when EXISTS should probably be used.

  8. I wrote about the same topic some time ago, used quite large datasets for my tests. Read it here http://baldy.co.za/post/2009/04/23/sql-in-vs-join/#more-13

  9. Gail, I think IN vs. EXISTS yield the same performance in specific cases in more recent versions of the engine. I’m not sure if it came about in a 2005 SP or in 2008, but essentially the optimizer will treat an IN() like an EXISTS() if it can, and so you will see an identical plan. If you go back to an older version (or if the criteria makes this re-write impossible), the optimizer won’t do this.

  10. All my tests lately are SQL 2008. I don’t have anything earlier than 2005 SP3 installed any longer.

    I know I’m testing the simplistic cases, but I think they’re the more common scenario in real code.

  11. I think the following would be logically equivalent to the IN construction, but wonder if there is any performance gain/loss from it:
    SELECT ID FROM BigTable
    INNER JOIN
    (SELECT DISTINCT ID FROM SomeTable) SomeTable
    on BigTable.ID = SomeTable.ID

  12. Try it and aee?

    The execution plans are absolutely identical and the cpu times are extremely close.

  13. I suppose at some point the issue will turn to the human factor versus the performance factor. Is the code more legible, easier to read and understand? More portable to different platforms? All other things being equal, the EXISTS construction seems the more self-explanatory expression; the verb conveys more meaning to humans.

  14. I prefer EXISTS, for the readability, and for other reasons as well.

    I don’t buy the portability issue. Unless you’re an ISV writing an app for multiple database platforms, the likelyhood of having to move a app from one database platform to another is low, mainly because the cost of doing so it high (testing at the very least) and the gains are usually very low.

  15. One of the main reasons that I started this series is that I keep seeing people, on forums or discussion groups, saying things like ‘replace the IN with a JOIN, it performs better’ or ‘replace your subqueries with joins, they’re faster’ or ‘EXISTS uses a correlated subquery, hence it’s slower than IN’

    I wanted to get some tested performance stats so I could debunk or refute those claims that are made with no ground and no proof.

  16. Nice Article

  17. Thanks ! Wonderful explanation. This helped resolve an issue!

  18. Hello,

    I simple explanation about your tests. IN was better than INNER JOIN ONLY because the result set you had used in your example had few registers. If you work with a large number of registers, and your table have unique indexs, INNER JOIN is probably better.
    Try this out ;)

  19. You mean a large number of matches, so each row in the one table has lots and lots of rows in the second?

    In that case, you’d need to inner join and distinct to get rid of the duplicates that appear because of the full join. Adding a sort or hash to the execution plan is unlikely to make it faster.

  20. Thanks for sharing the info.
    Theres another form of in query, where the in clause actually has the values which are comma separated. In such cases I have found that as the number of records in the in-clause increase the innner join query proves to be exponentially better than the in query. The difference becomes obvious when the number of records cross 100.

  21. In with value list is not really equivalent to an inner join, unless you take the time to populate the values into a table. If you do that, then the IN with subquery will likely still be faster (a little) than the inner join.

  22. I agree, with the fact that if you populate the vaules in a table the subquery with IN will be slightly faster then an inner join. I was speaking from my experience where ‘a’ programmer, constructed the values which would form the part of the IN query by stringing together a comma seperated values (this method in itself is very inefficient). In such cases, if the programmer takes an approach of inserting such values in a table and doing an IN query, the performance will increase, as shown by you in this article.

  23. I think that this statement is wreong:
    —-quote
    DECLARE @SomeTable (IntCol int)
    Insert into @SomeTable (IntCol) Values (1)
    Insert into @SomeTable (IntCol) Values (2)
    Insert into @SomeTable (IntCol) Values (2)
    Insert into @SomeTable (IntCol) Values (3)
    Insert into @SomeTable (IntCol) Values (4)
    Insert into @SomeTable (IntCol) Values (5)
    Insert into @SomeTable (IntCol) Values (5)

    SELECT *
    FROM BigTable
    WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

    This returns 5 rows and only columns from BigTable.
    —-end quote
    There’s no telling how many rows will be returned from BigTable – BigTable might have a 1000 rows where SomeColumn is 1, in which case the query will return at least 1000 rows; or it might have no rows with SomColumn less than 25, in which case it will return 0 rows. SOme people might be think you qactually meant what you wrote there, and be misled.

  24. I’ve added the missing definition for BigTable.

  25. [...] operadores en ese sistema gestor de base de datos. Sobre Sql Server existe una buena comparativa en esta página, la cual da como ganador a in versus inner join. Pero…y sobre Mysql? No he podido encontrar [...]

  26. Great article.
    A side issue, but may be useful to some. If you create views out of the two queries, EXISTS is still updatable, whereas LEFT OUTER JOIN
    give the error ‘…is not updatable because the modification affects multiple base tables’

    – Query 1
    CREATE VIEW JoinView AS
    SELECT BigTable.ID, SomeColumn
    FROM BigTable LEFT OUTER JOIN SmallerTable ON BigTable.SomeColumn = SmallerTable.LookupColumn
    WHERE LookupColumn IS NULL
    GO
    – Query 2
    CREATE VIEW ExistsView AS
    SELECT ID, SomeColumn FROM BigTable
    WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)
    GO
    DELETE FROM ExistsView WHERE ID=123
    GO
    DELETE FROM JoinView WHERE ID=123

  27. What should I use between two type of code:
    *** B is A’s parent and I want to delete some rows of A

    DELETE A FROM A INNER JOIN B ON A.ID = B.ID
    WHERE B.ReFID IN
    (SELECT ID FROM C WHERE C.DATE >= CONVERT(DATETIME,@DATE,103)
    UNION
    SELECT ID FROM D WHERE D.DATE >= CONVERT(DATETIME,@DATE,103)
    UNION
    SELECT ID FROM E WHERE E.DATE >= CONVERT(DATETIME,@DATE,103)) AS CDE

    OR
    DELETE A FROM A INNER JOIN B ON A.ID = B.ID
    INNER JOIN
    (SELECT ID FROM C WHERE C.DATE >= CONVERT(DATETIME,@DATE,103)
    UNION
    SELECT ID FROM D WHERE D.DATE >= CONVERT(DATETIME,@DATE,103)
    UNION
    SELECT ID FROM E WHERE E.DATE >= CONVERT(DATETIME,@DATE,103)) AS CDE ON B.RefID = CDE.ID

  28. Either should work.

Leave a Reply