<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>SQL in the Wild &#187; T-SQL</title>
	<atom:link href="http://sqlinthewild.co.za/index.php/category/sql-server/t-sql/feed/" rel="self" type="application/rss+xml" />
	<link>http://sqlinthewild.co.za</link>
	<description>A discussion on SQL Server</description>
	<lastBuildDate>Tue, 31 Aug 2010 17:26:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>In, Exists and join &#8211; a roundup</title>
		<link>http://sqlinthewild.co.za/index.php/2010/04/27/in-exists-and-join-a-roundup/</link>
		<comments>http://sqlinthewild.co.za/index.php/2010/04/27/in-exists-and-join-a-roundup/#comments</comments>
		<pubDate>Tue, 27 Apr 2010 16:30:47 +0000</pubDate>
		<dc:creator>Gail</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[Syndication]]></category>
		<category><![CDATA[T-SQL]]></category>

		<guid isPermaLink="false">http://sqlinthewild.co.za/?p=586</guid>
		<description><![CDATA[Over the last several months I&#8217;ve had a look at IN, Exists, Join and their opposites to see how they perform and whether there&#8217;s any truth in the advice that is often seen on forums and blogs advocating replacing one with the other. Previous parts of this series can be found: Exists vs In In [...]]]></description>
			<content:encoded><![CDATA[<p>Over the last several months I&#8217;ve had a look at IN, Exists, Join and their opposites to see how they perform and whether there&#8217;s any truth in the advice that is often seen on forums and blogs advocating replacing one with the other.</p>
<p>Previous parts of this series can be found:</p>
<ul>
<li><a href="http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/">Exists vs In</a></li>
<li><a href="http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/">In vs Inner Join</a></li>
<li><a href="http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/">Not Exists vs Not In</a></li>
<li><a href="http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/">Not Exists vs Left Outer Join … Is Null</a></li>
</ul>
<p>In this roundup post, I&#8217;m going to do multiple tests on the 6 query forms, with different numbers of rows, indexes, no indexes and, for the negative forms (NOT IN, NOT EXISTS), nullable and non-nullable join columns.</p>
<p>In the individual tests, I used 250000 rows in the first table and around 3000 rows in the secondary table. In this roundup, I&#8217;m going to use 3 different row counts, 1000000 rows, 50000 rows and 2500 rows. That should give a reasonable idea for performance at various table sizes. (Not much point in going smaller than 2500 rows. Everything&#8217;s fast on 100 rows)</p>
<p>Some notes on the tests.</p>
<ul>
<li>The version of SQL is SQL Server 2008 SP1 x64 Developer Edition.</li>
<li>The tests were run on a laptop. Core-2 Duo, 3 GB memory. SQL limited to 1 processor, so no parallelism possible.</li>
<li>Each query will be run 10 times, reads, cpu and duration measured by profiler and averaged.</li>
<li>Each query will be run once before the tests start to ensure that the data is in cache and the execution plans are generated and cached.</li>
<li>Reproduction scripts will be available for download.</li>
</ul>
<h3><span id="more-586"></span>Exists vs. In vs. Inner Join</h3>
<p>First, no indexes on the join columns</p>
<table border="1" cellspacing="0" cellpadding="2">
<tbody>
<tr>
<td valign="top"><strong>Table Size</strong></td>
<td valign="top"><strong>Operator</strong></td>
<td valign="top"><strong>CPU</strong></td>
<td valign="top"><strong>Reads </strong></td>
<td valign="top"><strong>Duration</strong></td>
</tr>
<tr>
<td rowspan="3">Large</td>
<td valign="top">IN</td>
<td valign="top">1293</td>
<td valign="top">14585</td>
<td valign="top">9649</td>
</tr>
<tr>
<td valign="top">Exists</td>
<td valign="top">1260</td>
<td valign="top">14585</td>
<td valign="top">9573</td>
</tr>
<tr>
<td valign="top">Inner Join</td>
<td valign="top">1302</td>
<td valign="top">14585</td>
<td valign="top">9716</td>
</tr>
<tr>
<td rowspan="3">Medium</td>
<td valign="top">IN</td>
<td valign="top">59</td>
<td valign="top">747</td>
<td valign="top">538</td>
</tr>
<tr>
<td valign="top">Exists</td>
<td valign="top">78</td>
<td valign="top">747</td>
<td valign="top">574</td>
</tr>
<tr>
<td valign="top">Inner Join</td>
<td valign="top">69</td>
<td valign="top">747</td>
<td valign="top">523</td>
</tr>
<tr>
<td rowspan="3">Small</td>
<td valign="top">IN</td>
<td valign="top">7</td>
<td valign="top">41</td>
<td valign="top">65</td>
</tr>
<tr>
<td valign="top">Exists</td>
<td valign="top">3</td>
<td valign="top">41</td>
<td valign="top">91</td>
</tr>
<tr>
<td valign="top">Inner Join</td>
<td valign="top">4</td>
<td valign="top">41</td>
<td valign="top">65</td>
</tr>
</tbody>
</table>
<p>Now with indexes on the join columns</p>
<table border="1" cellspacing="0" cellpadding="2">
<tbody>
<tr>
<td valign="top"><strong>Table Size</strong></td>
<td valign="top"><strong>Operator</strong></td>
<td valign="top"><strong>CPU</strong></td>
<td valign="top"><strong>Reads </strong></td>
<td valign="top"><strong>Duration </strong></td>
</tr>
<tr>
<td rowspan="3">Large</td>
<td valign="top">IN</td>
<td valign="top">973</td>
<td valign="top">1760</td>
<td valign="top">9707</td>
</tr>
<tr>
<td valign="top">Exists</td>
<td valign="top">956</td>
<td valign="top">1760</td>
<td valign="top">9483</td>
</tr>
<tr>
<td valign="top">Inner Join</td>
<td valign="top">1173</td>
<td valign="top">1760</td>
<td valign="top">9539</td>
</tr>
<tr>
<td rowspan="3">Medium</td>
<td valign="top">IN</td>
<td valign="top">43</td>
<td valign="top">100</td>
<td valign="top">516</td>
</tr>
<tr>
<td valign="top">Exists</td>
<td valign="top">53</td>
<td valign="top">100</td>
<td valign="top">548</td>
</tr>
<tr>
<td valign="top">Inner Join</td>
<td valign="top">59</td>
<td valign="top">100</td>
<td valign="top">498</td>
</tr>
<tr>
<td rowspan="3">Small</td>
<td valign="top">IN</td>
<td valign="top">3</td>
<td valign="top">9</td>
<td valign="top">64</td>
</tr>
<tr>
<td valign="top">Exists</td>
<td valign="top">1</td>
<td valign="top">9</td>
<td valign="top">80</td>
</tr>
<tr>
<td valign="top">Inner Join</td>
<td valign="top">4</td>
<td valign="top">9</td>
<td valign="top">67</td>
</tr>
</tbody>
</table>
<h3>Not Exists vs. Not In vs. Left Outer Join &#8230; Is Null</h3>
<p>First test with the columns join columns nullable, no indexes</p>
<table border="1" cellspacing="0" cellpadding="2">
<tbody>
<tr>
<td valign="top"><strong>Table Size</strong></td>
<td valign="top"><strong>Operator</strong></td>
<td valign="top"><strong>CPU</strong></td>
<td valign="top"><strong>Reads </strong></td>
<td valign="top"><strong>Duration </strong></td>
</tr>
<tr>
<td rowspan="3">Large</td>
<td valign="top">NOT IN</td>
<td valign="top">3194</td>
<td valign="top">2014622</td>
<td valign="top">3251</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">820</td>
<td valign="top">14585</td>
<td valign="top">837</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">962</td>
<td valign="top">14585</td>
<td valign="top">1025</td>
</tr>
<tr>
<td rowspan="3">Medium</td>
<td valign="top">NOT IN</td>
<td valign="top">174</td>
<td valign="top">100765</td>
<td valign="top">217</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">54</td>
<td valign="top">747</td>
<td valign="top">121</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">53</td>
<td valign="top">747</td>
<td valign="top">79</td>
</tr>
<tr>
<td rowspan="3">Small</td>
<td valign="top">NOT IN</td>
<td valign="top">12</td>
<td valign="top">5043</td>
<td valign="top">13</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">4</td>
<td valign="top">41</td>
<td valign="top">6</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">3</td>
<td valign="top">41</td>
<td valign="top">5</td>
</tr>
</tbody>
</table>
<p>Then with join columns nullable with indexes</p>
<table border="1" cellspacing="0" cellpadding="2">
<tbody>
<tr>
<td valign="top"><strong>Table Size</strong></td>
<td valign="top"><strong>Operator</strong></td>
<td valign="top"><strong>CPU</strong></td>
<td valign="top"><strong>Reads </strong></td>
<td valign="top"><strong>Duration </strong></td>
</tr>
<tr>
<td rowspan="3">Large</td>
<td valign="top">NOT IN</td>
<td valign="top">2677</td>
<td valign="top">2001762</td>
<td valign="top">2726</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">569</td>
<td valign="top">1760</td>
<td valign="top">586</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">949</td>
<td valign="top">1760</td>
<td valign="top">1029</td>
</tr>
<tr>
<td rowspan="3">Medium</td>
<td valign="top">NOT IN</td>
<td valign="top">137</td>
<td valign="top">100102</td>
<td valign="top">164</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">40</td>
<td valign="top">100</td>
<td valign="top">104</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">48</td>
<td valign="top">100</td>
<td valign="top">69</td>
</tr>
<tr>
<td rowspan="3">Small</td>
<td valign="top">NOT IN</td>
<td valign="top">11</td>
<td valign="top">5011</td>
<td valign="top">12</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">3</td>
<td valign="top">9</td>
<td valign="top">4</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">6</td>
<td valign="top">9</td>
<td valign="top">6</td>
</tr>
</tbody>
</table>
<p>Now, let&#8217;s make the join columns not nullable. Again, no indexes to start with.</p>
<table border="1" cellspacing="0" cellpadding="2">
<tbody>
<tr>
<td valign="top"><strong>Table Size</strong></td>
<td valign="top"><strong>Operator</strong></td>
<td valign="top"><strong>CPU</strong></td>
<td valign="top"><strong>Reads </strong></td>
<td valign="top"><strong>Duration </strong></td>
</tr>
<tr>
<td rowspan="3">Large</td>
<td valign="top">NOT IN</td>
<td valign="top">741</td>
<td valign="top">14585</td>
<td valign="top">753</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">784</td>
<td valign="top">14585</td>
<td valign="top">790</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">884</td>
<td valign="top">14585</td>
<td valign="top">937</td>
</tr>
<tr>
<td rowspan="3">Medium</td>
<td valign="top">NOT IN</td>
<td valign="top">43</td>
<td valign="top">747</td>
<td valign="top">103</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">49</td>
<td valign="top">747</td>
<td valign="top">120</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">53</td>
<td valign="top">747</td>
<td valign="top">74</td>
</tr>
<tr>
<td rowspan="3">Small</td>
<td valign="top">NOT IN</td>
<td valign="top">4</td>
<td valign="top">41</td>
<td valign="top">4</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">1</td>
<td valign="top">41</td>
<td valign="top">5</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">1</td>
<td valign="top">41</td>
<td valign="top">5</td>
</tr>
</tbody>
</table>
<p>and finally, join columns not nullable, with indexes</p>
<table border="1" cellspacing="0" cellpadding="2">
<tbody>
<tr>
<td valign="top"><strong>Table Size</strong></td>
<td valign="top"><strong>Operator</strong></td>
<td valign="top"><strong>CPU</strong></td>
<td valign="top"><strong>Reads </strong></td>
<td valign="top"><strong>Duration </strong></td>
</tr>
<tr>
<td rowspan="3">Large</td>
<td valign="top">NOT IN</td>
<td valign="top">578</td>
<td valign="top">1382</td>
<td valign="top">588</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">585</td>
<td valign="top">1382</td>
<td valign="top">597</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">953</td>
<td valign="top">1382</td>
<td valign="top">1006</td>
</tr>
<tr>
<td rowspan="3">Medium</td>
<td valign="top">NOT IN</td>
<td valign="top">37</td>
<td valign="top">80</td>
<td valign="top">79</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">34</td>
<td valign="top">80</td>
<td valign="top">79</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">39</td>
<td valign="top">80</td>
<td valign="top">84</td>
</tr>
<tr>
<td rowspan="3">Small</td>
<td valign="top">NOT IN</td>
<td valign="top">3</td>
<td valign="top">8</td>
<td valign="top">4</td>
</tr>
<tr>
<td valign="top">NOT Exists</td>
<td valign="top">1</td>
<td valign="top">8</td>
<td valign="top">5</td>
</tr>
<tr>
<td valign="top">Outer Join</td>
<td valign="top">4</td>
<td valign="top">8</td>
<td valign="top">5</td>
</tr>
</tbody>
</table>
<p>These results seem to pretty much confirm the earlier conclusions.</p>
<p>Exists and IN perform much the same, whether there are indexes on the join column or not. When there are indexes on the join columns, the INNER JOIN is slightly (very slightly) slower, which is more noticeable on the large tables, much less on the medium or small ones. (Note I&#8217;m mostly looking at CPU time, as the duration is also affected by sending of results to client, in this case, lots and lots of results)</p>
<p>When it comes to NOT In and NOT Exists they perform much the same when the columns involved are not nullable. If the columns are nullable, Not In is significantly slower because it has a different behaviour when nulls are present.</p>
<p>The join is slightly slower than Not Exists (or Not In on non-nullable columns), again only noticeable on the large table, probably because the optimiser has to do a full join with a secondary filter rather than the anti-semi join that it can use for Not Exists and Not In.</p>
<p>My conclusion from earlier posts stands. If all you are doing is looking for matching or non-matching rows and you don&#8217;t need any columns from the second table, use IN or Exists (or their negations), as appropriate for the situation. Only when you need columns from the second table should Join be used.</p>
<p>I think (and hope) that this adequately concludes the discussion on the Exists and In and joins, both behaviour and performance.</p>
<p><a href="http://sqlinthewild.co.za/wp-content/uploads/2010/04/Reproduction-scripts.zip">Reproduction scripts</a></p>
]]></content:encoded>
			<wfw:commentRss>http://sqlinthewild.co.za/index.php/2010/04/27/in-exists-and-join-a-roundup/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Left outer join vs NOT EXISTS</title>
		<link>http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/</link>
		<comments>http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/#comments</comments>
		<pubDate>Tue, 23 Mar 2010 14:00:58 +0000</pubDate>
		<dc:creator>Gail</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[Syndication]]></category>
		<category><![CDATA[T-SQL]]></category>

		<guid isPermaLink="false">http://sqlinthewild.co.za/?p=575</guid>
		<description><![CDATA[And to wrap up the miniseries on IN, EXISTS and JOIN, a look at NOT EXISTS and LEFT OUTER JOIN for finding non-matching rows. For previous parts, see In vs Exists In vs Inner Join Not in vs Not Exists I&#8217;m looking at NOT EXISTS and LEFT OUTER JOIN, as opposed to NOT IN and [...]]]></description>
			<content:encoded><![CDATA[<p>And to wrap up the miniseries on IN, EXISTS and JOIN, a look at NOT EXISTS and LEFT OUTER JOIN for finding non-matching rows.</p>
<p>For previous parts, see</p>
<ul>
<li><a href="http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/">In vs Exists</a></li>
<li><a href="http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/">In vs Inner Join</a></li>
<li><a href="http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/">Not in vs Not Exists</a></li>
</ul>
<p>I&#8217;m looking at NOT EXISTS and LEFT OUTER JOIN, as opposed to NOT IN and LEFT OUTER JOIN, because, as shown in the <a href="http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/">previous part</a> of this series, NOT IN behaves badly in the presence of NULLs. Specifically, if there are any NULLs in the result set, NOT IN returns 0 matches.</p>
<p>The LEFT OUTER JOIN, like the NOT EXISTS can handle NULLs in the second result set without automatically returning no matches. It behaves the same regardless of whether the join columns are nullable or not. Seeing as NULL does not equal anything, any rows in the second result set that have NULL for the join column are eliminated by the join and have no further effect on the query.</p>
<p>It is important, when using the LEFT OUTER JOIN … IS NULL, to carefully pick the column used for the IS NULL check. It should either be a non-nullable column (the primary key is a somewhat classical choice) or the join column (as nulls in that will be eliminated by the join)</p>
<p>Onto the tests</p>
<p>The usual test tables…</p>
<pre class="brush: sql;">
CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn char(4) NOT NULL,
Filler CHAR(100)
)

CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn char(4) NOT NULL,
SomeArbDate Datetime default getdate()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3918 row(s) affected)
</pre>
<p>First without indexes</p>
<pre class="brush: sql;">-- Query 1
SELECT BigTable.ID, SomeColumn
	FROM BigTable LEFT OUTER JOIN SmallerTable ON BigTable.SomeColumn = SmallerTable.LookupColumn
	WHERE LookupColumn IS NULL

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)
</pre>
<p>Let&#8217;s take a look at the execution plans</p>
<p><a href="http://sqlinthewild.co.za/wp-content/uploads/2010/03/LeftOuterJoinNotIN_NotIndexed.png"><img style="border-bottom: 0px; border-left: 0px; display: inline; border-top: 0px; border-right: 0px" title="LeftOuterJoinNotIN_NotIndexed" src="http://sqlinthewild.co.za/wp-content/uploads/2010/03/LeftOuterJoinNotIN_NotIndexed_thumb.png" border="0" alt="LeftOuterJoinNotIN_NotIndexed" width="244" height="143" /></a></p>
<p><span id="more-575"></span></p>
<p>The plans are almost the same. There&#8217;s an extra filter in the JOIN and the logical join types are different. Why the different joins?</p>
<p>If we look at the execution plan for the NOT EXISTS, the join type is Right Anti-Semi join (a bit of a mouthful). This is a special join type used by the NOT EXISTS and NOT IN and it&#8217;s the opposite of the semi-join that I discussed back when I looked at <a href="http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/">IN and INNER JOIN</a></p>
<p>An anti-semi join is a partial join. It does not actually join rows in from the second table, it simply checks for, in this case, the absence of matches. That&#8217;s why it&#8217;s an <strong>anti</strong>-semi join. A semi-join checks for matches, an anti-semi join does the opposite and checks for the absence of matches.</p>
<p>The extra filter in the LEFT OUTER JOIN query is because the join in that execution plan is a complete right join, i.e. it&#8217;s returned matching rows (and possibly duplicates) from the second table. The filter operator is doing the IS NULL filter.</p>
<p>That&#8217;s the major difference between these two. When using the LEFT OUTER JOIN … IS NULL technique, SQL can&#8217;t tell that you&#8217;re only doing a check for nonexistance. Optimiser&#8217;s not smart enough (yet). Hence it does the complete join and then filters. The NOT EXISTS filters as part of the join.</p>
<p>Technical discussion done, now how did they actually perform?</p>
<blockquote><p>&#8211; Query 1: LEFT OUTER JOIN<br />
Table &#8216;Worktable&#8217;. Scan count 0, logical reads 0, physical reads 0.<br />
Table &#8216;BigTable&#8217;. Scan count 1, logical reads 3639, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 15, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 157 ms,  elapsed time = 486 ms.</p>
<p>&#8211; Query 2: NOT EXISTS<br />
Table &#8216;Worktable&#8217;. Scan count 0, logical reads 0, physical reads 0.<br />
Table &#8216;BigTable&#8217;. Scan count 1, logical reads 3639, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 15, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 156 ms,  elapsed time = 358 ms.</p></blockquote>
<p>Can&#8217;t make a big deal out of that.</p>
<p>Now, index on the join columns</p>
<pre class="brush: sql;">CREATE INDEX idx_BigTable_SomeColumn
ON BigTable (SomeColumn)

CREATE INDEX idx_SmallerTable_LookupColumn
ON SmallerTable (LookupColumn)</pre>
<p>and the same queries</p>
<p><a href="http://sqlinthewild.co.za/wp-content/uploads/2010/03/LeftOuterJoinNotIN_Indexed.png"><img style="border-bottom: 0px; border-left: 0px; display: inline; border-top: 0px; border-right: 0px" title="LeftOuterJoinNotIN_Indexed" src="http://sqlinthewild.co.za/wp-content/uploads/2010/03/LeftOuterJoinNotIN_Indexed_thumb.png" border="0" alt="LeftOuterJoinNotIN_Indexed" width="244" height="130" /></a></p>
<p>With indexes added, the execution plans are even more different. The LEFT OUTER JOIN is still doing the complete outer join with a filter afterwards. It&#8217;s interesting to note that it&#8217;s still a hash join, even though both inputs are sorted in the order of the join keys.</p>
<p>The Not Exists now has a stream aggregate (because duplicate values are irrelevant for an EXISTS/NOT EXISTS) and an anti-semi join. The join here is no longer hash, it&#8217;s now a merge join.</p>
<p>This echoes what I found when looking at <a href="http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/">IN vs Inner join</a>. When the columns were indexed, the inner join still went for a hash join but the IN changed to a merge join. At the time, I thought it to be a fluke, I&#8217;m not so sure any longer. More tests on this are required…</p>
<p>The costing of the plans indicates that the optimiser believes that the LEFT OUTER JOIN form is more expensive. Do the execution stats carry the same conclusion?</p>
<blockquote><p>&#8211; Query 1: LEFT OUTER JOIN<br />
Table &#8216;Worktable&#8217;. Scan count 0, logical reads 0, physical reads 0.<br />
Table &#8216;BigTable&#8217;. Scan count 1, logical reads 342, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 8, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 172 ms,  elapsed time = 686 ms.</p>
<p>&#8211; Query 2: NOT EXISTS<br />
Table &#8216;BigTable&#8217;. Scan count 1, logical reads 342, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 8, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 78 ms,  elapsed time = 388 ms.</p></blockquote>
<p>Well, yes, they do.</p>
<p>The reads (ignoring the existence of the worktable for the hash join) are the same. That&#8217;s to be expected, both queries executed with a single scan of each index.</p>
<p>The CPU time figures are not. The CPU time of the LEFT OUTER JOIN form is almost twice that of the NOT EXISTS.</p>
<h3>In conclusion…</h3>
<p>If you need to find rows that don&#8217;t have a match in a second table, and the columns are nullable, use NOT EXISTS. If you need to find rows that don&#8217;t have a match in a second table, and the columns are not nullable, use NOT EXISTS or NOT IN.</p>
<p>The LEFT OUTER JOIN … IS NULL method is slower when the columns are indexed and it&#8217;s perhaps not as clear what&#8217;s happening. It&#8217;s reasonably clear what a NOT EXISTS predicate does, with LEFT OUTER JOIN it&#8217;s not immediately clear that it&#8217;s a check for non-matching rows, especially if there are several where clause predicates.</p>
<p>I think that&#8217;s about that for this series. I&#8217;m going to do one more post summarising all the findings, probably in a week or two.</p>
]]></content:encoded>
			<wfw:commentRss>http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>NOT EXISTS vs NOT IN</title>
		<link>http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/</link>
		<comments>http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/#comments</comments>
		<pubDate>Thu, 18 Feb 2010 14:00:32 +0000</pubDate>
		<dc:creator>Gail</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[Syndication]]></category>
		<category><![CDATA[T-SQL]]></category>

		<guid isPermaLink="false">http://sqlinthewild.co.za/?p=553</guid>
		<description><![CDATA[Continuing with the mini-series on query operators, I want to have a look at NOT EXISTS and NOT IN. Just one note before diving into that. The examples I’m using are fairly simplistic and that’s intentional. I’m trying to find what, if any, are the performance differences in a benchmark-style setup. I’ll have some comments [...]]]></description>
			<content:encoded><![CDATA[<p>Continuing with the mini-series on query operators, I want to have a look at NOT EXISTS and NOT IN.</p>
<p>Just one note before diving into that. The examples I’m using are fairly simplistic and that’s intentional. I’m trying to find what, if any, are the performance differences in a benchmark-style setup. I’ll have some comments on more complex examples in a later post.</p>
<p>The most important thing to note about NOT EXISTS and NOT IN is that, unlike EXISTS and IN,  they are not equivalent in all cases. Specifically, when NULLs are involved they will return different results. To be totally specific, when the subquery returns even one null, NOT IN will not match any rows.</p>
<p>The reason for this can be found by looking at the details of what the NOT IN operation actually means.</p>
<p>Let’s say, for illustration purposes that there are 4 rows in the table called t, there’s a column called ID with values 1..4</p>
<pre class="brush: sql;">WHERE SomeValue NOT IN (SELECT AVal FROM t)
is equivalent to
WHERE (
SomeValue != (SELECT AVal FROM t WHERE ID=1)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=2)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=3)
AND
SomeValue != (SELECT AVal FROM t WHERE ID=4)
)</pre>
<p>Let’s further say that AVal is NULL where ID = 4. Hence that != comparison returns UNKNOWN. The logical truth table for AND states that UNKNOWN and TRUE is UNKNOWN, UNKNOWN and FALSE is FALSE. There is no value that can be AND’d with UNKNOWN to produce the result TRUE</p>
<p>Hence, if any row of that subquery returns NULL, the entire NOT IN operator will evaluate to either FALSE or NULL and no records will be returned</p>
<p>So what about EXISTS?</p>
<p><span id="more-553"></span></p>
<p>Exists cannot return NULL. It’s checking solely for the presence or absence of a row in the subquery and, hence, it can only return true or false. Since it cannot return NULL, there’s no possibility of a single NULL resulting in the entire expression evaluating to UNKNOWN.</p>
<p>Hence, when the column in the subquery that’s used for comparison with the outer table can have nulls in it, consider carefully which of Not Exists or Not in you want to use.</p>
<p>Ok, but say there are no nulls in the column. How do they compare speed-wise. I’m going to do two tests, one where the columns involved in the comparison are defined as NULL and one where they are defined as NOT NULL. There will be no null values in the columns in either case. In both cases, the join columns will be indexed. After all, we all index our join columns, right?</p>
<p>So, first test, non-nullable columns. First some setup</p>
<pre class="brush: sql;">CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn char(4) NOT NULL,
Filler CHAR(100)
)

CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn char(4) NOT NULL,
SomeArbDate Datetime default getdate()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3898 row(s) affected)

CREATE INDEX idx_BigTable_SomeColumn
ON BigTable (SomeColumn)
CREATE INDEX idx_SmallerTable_LookupColumn
ON SmallerTable (LookupColumn)</pre>
<p>Then the queries</p>
<pre class="brush: sql;">-- Query 1
SELECT ID, SomeColumn FROM BigTable
WHERE SomeColumn NOT IN (SELECT LookupColumn FROM SmallerTable)

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)</pre>
<p>The first thing to note is that the execution plans are identical.</p>
<p><a href="http://sqlinthewild.co.za/wp-content/uploads/2010/02/ExecPlansNOTNULL.png"><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="ExecPlansNOTNULL" src="http://sqlinthewild.co.za/wp-content/uploads/2010/02/ExecPlansNOTNULL_thumb.png" border="0" alt="ExecPlansNOTNULL" width="244" height="130" /></a></p>
<p>The execution characteristics are also identical.</p>
<blockquote><p><strong>Query 1<br />
</strong>Table &#8216;BigTable&#8217;. Scan count 1, logical reads 342, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 8, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 156 ms,  elapsed time = 221 ms.</p>
<p><strong>Query 2<br />
</strong>Table &#8216;BigTable&#8217;. Scan count 1, logical reads 342, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 8, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 156 ms,  elapsed time = 247 ms.</p></blockquote>
<p>So, at least for the case where the columns are defined as NOT NULL, these two perform the same.</p>
<p>What about the case where the columns are defined as nullable? I&#8217;m going to simply alter the two columns involved without changing anything else, then test out the two queries again.</p>
<pre class="brush: sql;">ALTER TABLE BigTable
 ALTER COLUMN SomeColumn char(4) NULL

ALTER TABLE SmallerTable
 ALTER COLUMN LookupColumn char(4) NULL</pre>
<p>And the same two queries</p>
<pre class="brush: sql;">-- Query 1
&lt;pre&gt;SELECT ID, SomeColumn FROM BigTable
WHERE SomeColumn NOT IN (SELECT LookupColumn FROM SmallerTable)

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)</pre>
<p>And as for their performance…</p>
<p><a href="http://sqlinthewild.co.za/wp-content/uploads/2010/02/ExecPlansNull.png"><img style="border-right-width: 0px; display: inline; border-top-width: 0px; border-bottom-width: 0px; border-left-width: 0px" title="ExecPlansNull" src="http://sqlinthewild.co.za/wp-content/uploads/2010/02/ExecPlansNull_thumb.png" border="0" alt="ExecPlansNull" width="244" height="123" /></a></p>
<blockquote><p><strong>Query 1</strong><br />
Table &#8216;SmallerTable&#8217;. Scan count 3, logical reads 500011, physical reads 0.<br />
Table &#8216;BigTable&#8217;. Scan count 1, logical reads 437, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 827 ms,  elapsed time = 825 ms.</p>
<p><strong>Query 2<br />
</strong>Table &#8216;BigTable&#8217;. Scan count 1, logical reads 437, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 9, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 156 ms,  elapsed time = 228 ms.</p></blockquote>
<p>Radically different execution plans, radically different performance characteristics. The NOT IN took over 5 times longer to execute and did thousands of times more reads.</p>
<p>Why is that complex execution plan required when there may be nulls in the column? I can&#8217;t answer that one, probably only one of the query optimiser developer can, however the results are obvious. When the columns allow nulls but has none, the NOT IN performs significantly worse than NOT EXISTS.</p>
<p>So, take-aways from this?</p>
<p>Most importantly, NOT EXISTS and NOT IN do not have the same behaviour when there are NULLs involved. Chose carefully which you want.</p>
<p>Columns that will never contain NULL values should be defined as NOT NULL so that SQL knows there will never be NULL values in them and so that it doesn’t have to produce complex plans to handle potential nulls.</p>
<p>On non-nullable columns, the behaviour and performance of NOT IN and NOT EXISTS are the same, so use whichever one works better for the specific situation.</p>
]]></content:encoded>
			<wfw:commentRss>http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>IN vs INNER JOIN</title>
		<link>http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/</link>
		<comments>http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/#comments</comments>
		<pubDate>Tue, 12 Jan 2010 14:00:58 +0000</pubDate>
		<dc:creator>Gail</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[Syndication]]></category>
		<category><![CDATA[T-SQL]]></category>

		<guid isPermaLink="false">http://sqlinthewild.co.za/?p=394</guid>
		<description><![CDATA[Often in forum threads discussing query performance I&#8217;ll see people recommending replacing an INNER JOIN with an IN (or recommending replacing an IN with an INNER JOIN) for performance reasons. Hence it seems to be a good idea to investigate and see what the performance differences (if any) really are. One very important thing to [...]]]></description>
			<content:encoded><![CDATA[<p>Often in forum threads discussing query performance I&#8217;ll see people recommending replacing an INNER JOIN with an IN (or recommending replacing an IN with an INNER JOIN) for performance reasons. Hence it seems to be a good idea to investigate and see what the performance differences (if any) really are.</p>
<p>One very important thing to note right off is that they are not equivalent in all cases.</p>
<p>An inner join between two tables does a complete join, it checks for matches and returns rows. This means, if there are multiple matching rows in the second table, multiple rows will be returned. Also, when two tables are joined, columns can be returned from either.  As a quick example (definition of BigTable towards the end of the post)</p>
<pre class="brush: sql;">DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM BigTable b INNER JOIN @SomeTable  s ON b.SomeColumn IN s.IntCol</pre>
<p>This returns 7 rows and returns columns from both tables. Because the values in @SomeTable are duplicated, the matching rows from BigTable are returned twice.</p>
<p>With an IN, what is done is a semi-join, a join that checks for matches but does not return rows. This means if there are multiple matching tables in the resultset used for the IN, it doesn&#8217;t matter. Only one row from the first table will be returned. Also, because the rows are not returned, columns from the table referenced in the IN cannot be returned. As a quick example</p>
<pre class="brush: sql;">DECLARE @SomeTable (IntCol int)
 Insert into @SomeTable (IntCol) Values (1)
 Insert into @SomeTable (IntCol) Values (2)
 Insert into @SomeTable (IntCol) Values (2)
 Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
 Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM BigTable
 WHERE SomeColumn IN (Select IntCol FROM @SomeTable)</pre>
<p>This returns 5 rows and only columns from BigTable.</p>
<p>So, that said, how does the performance of the two differ for the cases where the results are identical (no duplicates in the second table, no columns needed from the second table)? For that, I&#8217;m going to need larger tables to play with.<span id="more-394"></span></p>
<pre class="brush: sql;">
DROP TABLE dbo.BigTable
DROP TABLE dbo.SmallerTable

CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn CHAR(4),
Filler CHAR(100)
)

CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn CHAR(4),
SomeArbDate DATETIME DEFAULT GETDATE()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000 char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) + char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
FROM master.sys.columns a CROSS JOIN master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3819 row(s) affected)
</pre>
<p>That&#8217;s the setup done, now for the two test cases. Let’s first try without indexes and see how the INNER JOIN and IN compare. I&#8217;m selecting from just the first table to ensure that the two queries are logically identical. The DISTINCT used to populate the smaller table ensures that there are no duplicate rows in the smaller table.</p>
<pre class="brush: sql;">SELECT BigTable.ID, SomeColumn
FROM BigTable
WHERE SomeColumn IN (SELECT LookupColumn FROM dbo.SmallerTable)

SELECT BigTable.ID, SomeColumn
FROM BigTable
INNER JOIN SmallerTable ON dbo.BigTable.SomeColumn = dbo.SmallerTable.LookupColumn</pre>
<p>Something of interest straight away, the execution plans are almost identical. Not completely identical, but the only difference is that the hash join for the IN shows a Hash Match (Right Semi Join) and the hash join for the INNER JOIN shows a Hash Match (Inner Join)</p>
<p><a href="http://sqlinthewild.co.za/wp-content/uploads/2010/01/InVsSelect-1.png"><img class="alignnone size-thumbnail wp-image-513" style="border: 1px solid black;" title="In Vs Select 1" src="http://sqlinthewild.co.za/wp-content/uploads/2010/01/InVsSelect-1-150x150.png" alt="In Vs Select 1" width="150" height="150" /></a></p>
<p>The IOs are the same and the durations are extremely similar. Here&#8217;s the IO results and durations for five tests.</p>
<p>IN</p>
<blockquote><p>Table &#8216;Worktable&#8217;. Scan count 0, logical reads 0, physical reads 0.<br />
Table &#8216;BigTable&#8217;. Scan count 1, logical reads 3639, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 14, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 156 ms,  elapsed time = 2502 ms.<br />
CPU time = 157 ms,  elapsed time = 2323 ms.<br />
CPU time = 156 ms,  elapsed time = 2555 ms.<br />
CPU time = 188 ms,  elapsed time = 2381 ms.<br />
CPU time = 203 ms,  elapsed time = 2312 ms.</p></blockquote>
<p>INNER JOIN</p>
<blockquote><p>Table &#8216;Worktable&#8217;. Scan count 0, logical reads 0, physical reads 0.<br />
Table &#8216;BigTable&#8217;. Scan count 1, logical reads 3639, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 14, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 125 ms,  elapsed time = 2922 ms.<br />
CPU time = 140 ms,  elapsed time = 2372 ms.<br />
CPU time = 188 ms,  elapsed time = 2530 ms.<br />
CPU time = 203 ms,  elapsed time = 2323 ms.<br />
CPU time = 187 ms,  elapsed time = 2512 ms.</p></blockquote>
<p>Now let&#8217;s try with some indexes on the join columns.</p>
<pre class="brush: sql;">CREATE INDEX idx_BigTable_SomeColumn ON BigTable (SomeColumn)
CREATE INDEX idx_SmallerTable_LookupColumn ON SmallerTable (LookupColumn)</pre>
<p>Now when I run the two queries, the execution plans are different, and the costs of the two are no longer 50%. Both do a single index scan on each table, but the IN has a Merge Join (Inner Join) and the INNER JOIN has a Hash Match (Inner Join)</p>
<p><a href="http://sqlinthewild.co.za/wp-content/uploads/2010/01/InVsSelect-2.png"><img class="alignnone size-thumbnail wp-image-514" style="border: 1px solid black;" title="InVsSelect 2" src="http://sqlinthewild.co.za/wp-content/uploads/2010/01/InVsSelect-2-150x150.png" alt="InVsSelect 2" width="150" height="150" /></a></p>
<p>The IOs are still identical, other than the WorkTable that only appears for the Hash Join.</p>
<p>IN</p>
<blockquote><p>Table &#8216;BigTable&#8217;. Scan count 1, logical reads 3639, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 14, physical reads 0.</p></blockquote>
<p>INNER JOIN</p>
<blockquote><p>Table &#8216;Worktable&#8217;. Scan count 0, logical reads 0, physical reads 0.<br />
Table &#8216;BigTable&#8217;. Scan count 1, logical reads 3639, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 14, physical reads 0.</p></blockquote>
<p>So what about the durations? Honestly it&#8217;s hard to say anything completely conclusive, the durations of both queries are quite small and they are very close. To see if there is any measurable different, I&#8217;m going to run each one 100 times, use Profiler to log the duration and CPU and then average the results over the 100 executions. While running this, I&#8217;m also going to close/disable everything else I can on the computer, to try and get reasonably accurate times.</p>
<p>IN</p>
<p>Average CPU: 130.<br />
Avg duration: 2.78 seconds</p>
<p>INNER JOIN</p>
<p>Average CPU: 161.<br />
Avg duration: 2.93 seconds</p>
<p>Now is that enough to be significant? I&#8217;m not sure. However, looking at those results along with the IO and execution plans, I do have a recommendation for In vs Inner Join</p>
<p>If all you need is to check for matching rows in the other table but don&#8217;t need any columns from that table, use IN. If you do need columns from the second table, use Inner Join.</p>
<p>I still intend to go over NOT IN and NOT EXISTS and, after this one, I also want to take a look at the LEFT JOIN with IS NULL check vs NOT IN for when you want rows from Table1 that don&#8217;t have a match in Table 2.</p>
]]></content:encoded>
			<wfw:commentRss>http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>EXISTS vs IN</title>
		<link>http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/</link>
		<comments>http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/#comments</comments>
		<pubDate>Mon, 17 Aug 2009 14:01:37 +0000</pubDate>
		<dc:creator>Gail</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[Syndication]]></category>
		<category><![CDATA[T-SQL]]></category>

		<guid isPermaLink="false">http://sqlinthewild.co.za/?p=287</guid>
		<description><![CDATA[This one&#8217;s come up a few times recently, so I&#8217;ll take a look at it. The EXISTS and IN clauses at first glance look fairly similar. They both use a subquery to evaluate rows, but they do it in a slightly different way IN does a direct match between the column specified before the IN [...]]]></description>
			<content:encoded><![CDATA[<p>This one&#8217;s come up a few times recently, so I&#8217;ll take a look at it.</p>
<p>The EXISTS and IN clauses at first glance look fairly similar. They both use a subquery to evaluate rows, but they do it in a slightly different way</p>
<p>IN does a direct match between the column specified before the IN keyword and the values returned by the subquery. When using IN there can only be a single column specified in the select clause of the subquery</p>
<p>Let&#8217;s have a look at a quick example</p>
<pre class="brush: sql;">DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
FROM BigTable
WHERE SomeColumn IN (Select IntCol FROM @SomeTable)</pre>
<p>So this query returns all the rows from BigTable where SomeColumn has any of the values returned by the subquery, that is 1,2,3,4 or 5</p>
<p>But what if there were duplicate values returned by the subquery?</p>
<p><span id="more-287"></span>Well, it actually doesn&#8217;t matter. All that SQL is looking for is what values the subquery returns to process the filter. It&#8217;s not joining the two resultsets together so it makes no difference to the results if there are duplicate values returned by the subquery.</p>
<p>To put it more technically, SQL&#8217;s doing a semi-join, a join that can only eliminate or qualify rows from the first table, but cannot duplicate them.</p>
<p>So that&#8217;s IN. What about EXISTS</p>
<p>Exists doesn&#8217;t check for a match, it doesn&#8217;t care in the slightest what values are been returned from the expression, it just checks for whether a row exists or not. Because of that, if there&#8217;s no predicate in the WHERE clause of the subquery that compares rows in the subquery with rows in the outer query, EXISTS will either return true for all the rows in the outer query or it will return false for all the rows in the outer query</p>
<pre class="brush: sql;">DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
FROM BigTable
WHERE EXISTS (Select IntCol FROM @SomeTable)</pre>
<p>This will also return every single row in BigTable, because Select IntCol FROM @SomeTable returns 5 rows and hence the EXISTS predicate is always true.</p>
<p>Hence, to use EXISTS to do the same kind of thing as IN, there must be a correlation predicate within the subquery</p>
<pre class="brush: sql;">DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
FROM BigTable bt
WHERE EXISTS (Select IntCol FROM @SomeTable st WHERE bt.SomeColumn = st.IntCol)</pre>
<p>Now this will behave like the IN because it&#8217;s checking for matching rows and only returning true when there is a match. This will return all the rows from BigTable where SomeColumn has values 1,2,3,4 or 5 because those are the</p>
<p>Exists is better for when comparisons are needed on two or more columns. For eg, this cannot be done easily with an IN</p>
<pre class="brush: sql;">DECLARE @SomeTable (IntCol int, charCol char(1))
Insert into @SomeTable (IntCol, charCol) Values (1, 'a')
Insert into @SomeTable (IntCol, charCol) Values (2, 'a')
Insert into @SomeTable (IntCol, charCol) Values (3, 'a')
Insert into @SomeTable (IntCol, charCol) Values (4, 'b')
Insert into @SomeTable (IntCol, charCol) Values (5, 'b')

SELECT *
FROM BigTable bt
WHERE EXISTS (Select IntCol FROM @SomeTable st
WHERE bt.SomeColumn = st.IntCol AND bt.SomeOtherColumn = st.charCol)</pre>
<p>So that covers how they work, but how do they perform in comparison with each other? To answer that question, first I need some fairly large tables.</p>
<pre class="brush: sql;">Create Table BigTable (
id int identity primary key,
SomeColumn char(4),
Filler char(100)
)

Create Table SmallerTable (
id int identity primary key,
LookupColumn char(4),
SomeArbDate Datetime default getdate()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3955 row(s) affected)
</pre>
<p>Let&#8217;s first try without indexes and see how EXISTS and IN compare.</p>
<pre class="brush: sql;">-- Query 1
SELECT ID, SomeColumn FROM BigTable
WHERE SomeColumn IN (SELECT LookupColumn FROM SmallerTable)

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE EXISTS (SELECT LookupColumn FROM SmallerTable WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)</pre>
<p>The first thing to note is that the execution plans are identical. Two clustered index scans and a hash join (right semi-join).</p>
<p><a href="http://sqlinthewild.co.za/wp-content/uploads/2009/08/exists1.png"><img class="alignnone size-thumbnail wp-image-297" title="Exists vs IN" src="http://sqlinthewild.co.za/wp-content/uploads/2009/08/exists1-150x150.png" alt="" width="150" height="150" /></a></p>
<p>The IOs are also identical.</p>
<blockquote><p>Table &#8216;Worktable&#8217;. Scan count 0, logical reads 0, physical reads 0.<br />
Table &#8216;BigTable&#8217;. Scan count 1, logical reads 3639, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 15, physical reads 0.</p></blockquote>
<p>So these two queries are executed by SQL in exactly the same way. No performance differences here.</p>
<p>Now, let me add indexes to both tables, on that join column and see what changes.</p>
<pre class="brush: sql;">CREATE INDEX idx_BigTable_SomeColumn
ON BigTable (SomeColumn)
CREATE INDEX idx_SmallerTable_LookupColumn
ON SmallerTable (LookupColumn)</pre>
<p>With those created, I&#8217;m going to run the above two queries again. Again the execution plans of the two are identical, though the hash join and clustered index scans are gone, replaced by index scans, a stream aggregate and a merge join (inner join)</p>
<p><a href="http://sqlinthewild.co.za/wp-content/uploads/2009/08/exists2.png"><img class="alignnone size-thumbnail wp-image-298" title="Exists vs IN" src="http://sqlinthewild.co.za/wp-content/uploads/2009/08/exists2-150x150.png" alt="" width="150" height="150" /></a></p>
<p>The IOs are again identical and the execution times very close.</p>
<blockquote><p>Table &#8216;BigTable&#8217;. Scan count 1, logical reads 343, physical reads 0.<br />
Table &#8216;SmallerTable&#8217;. Scan count 1, logical reads 8, physical reads 0.</p></blockquote>
<p>So IN and EXISTS appear to perform identically both when there are no indexes on the matching columns and when there are, and this is true regardless of whether of not there are nulls in either the subquery or in the outer table.</p>
<p>But what about NOT IN and NOT EXISTS? That&#8217;s going to have to wait for another post, because it&#8217;s not as simple as it may initially appear.</p>
]]></content:encoded>
			<wfw:commentRss>http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Functions, IO statistics and the Execution plan</title>
		<link>http://sqlinthewild.co.za/index.php/2009/04/29/functions-io-statistics-and-the-execution-plan/</link>
		<comments>http://sqlinthewild.co.za/index.php/2009/04/29/functions-io-statistics-and-the-execution-plan/#comments</comments>
		<pubDate>Wed, 29 Apr 2009 21:21:49 +0000</pubDate>
		<dc:creator>Gail</dc:creator>
				<category><![CDATA[Performance]]></category>
		<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[Syndication]]></category>
		<category><![CDATA[T-SQL]]></category>

		<guid isPermaLink="false">http://sqlinthewild.co.za/?p=231</guid>
		<description><![CDATA[It&#8217;s no secret that I&#8217;m not overly fond of most user-defined functions. This isn&#8217;t just a pet hate, I have some good reasons for disliking them. All too often they&#8217;re performance bottlenecks, but that can be said about many things in SQL. The bigger problem is that they&#8217;re hidden performance bottlenecks that often go overlooked [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s no secret that I&#8217;m not overly fond of most user-defined functions. This isn&#8217;t just a pet hate, I have some good reasons for disliking them. All too often they&#8217;re performance bottlenecks, but that can be said about many things in SQL. The bigger problem is that they&#8217;re hidden performance bottlenecks that often go overlooked and ignored for too long.</p>
<p>I&#8217;m going to start with this fairly simple scalar function, created in the AdventureWorks database</p>
<p>Create function LineItemTotal(@ProductID int)<br />
returns money<br />
as<br />
begin<br />
declare @Total money</p>
<p>select @Total = sum(LineTotal) from sales.SalesOrderDetail where productid = @ProductID</p>
<p>return @Total<br />
end</p>
<p>So, given that function, the following two queries should be equivalent.</p>
<p>SELECT productid, productnumber, dbo.LineItemTotal(productid) as SumTotal<br />
FROM Production.Product p</p>
<p>SELECT productid, productnumber,<br />
(select sum(LineTotal) from sales.SalesOrderDetail where productid = p.productid) AS SumTotal<br />
FROM Production.Product p</p>
<p>No problems so far. They both return 504 rows (in my copy of AW, which has been slightly padded out with more data). Now, let&#8217;s look at the execution characteristics by running them again with Statistics IO and Statistics Time on.</p>
<p>Query 1, the one with the scalar function:</p>
<blockquote><p>Table &#8216;Product&#8217;. Scan count 1, logical reads 4, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 47297 ms,  elapsed time = 47541 ms.</p></blockquote>
<p>Query 2, the one with the correlated subquery:</p>
<blockquote><p>Table &#8216;Worktable&#8217;. Scan count 0, logical reads 0, physical reads 0.<br />
Table &#8216;SalesOrderDetail&#8217;. Scan count 3, logical reads 22536, physical reads 0.<br />
Table &#8216;Product&#8217;. Scan count 3, logical reads 40, physical reads 0.</p>
<p>SQL Server Execution Times:<br />
CPU time = 1047 ms,  elapsed time = 1249 ms.</p></blockquote>
<p><span id="more-231"></span>There are two things to note here. Firstly the execution time. While the query with a correlated subquery used just over a second of CPU time, the query with the function used close to a minute of CPU time. But take a look at the IO statistics, not so much for what&#8217;s there, but for what&#8217;s not there. In the IO stats for the query with the user-defined function, there&#8217;s no mention of the SalesOrderDetail table at all.</p>
<p>The output of Statistics IO only shows the IO characteristics of the outer query, not of the function, so from here we have absolutely no idea what kind of IO impact that function is causing.</p>
<p>So that&#8217;s one problem, now what about the execution plan? From the CPU and elapsed times, it&#8217;s pretty clear that the query with the function is far more expensive than the query without.</p>
<p><a href="http://sqlinthewild.co.za/wp-content/uploads/2009/04/functionexecplan.png"><img class="alignnone size-medium wp-image-237" style="border: 1px solid black;" title="functionexecplan" src="http://sqlinthewild.co.za/wp-content/uploads/2009/04/functionexecplan-300x104.png" alt="" width="300" height="104" /></a></p>
<p>The costings on those plans are so far off as to be completely useless. The query that takes 1 sec of CPU time is apparently 100% of the cost of the batch and the one that takes 47sec of CPU time is apparently 0% of the cost of the batch. Somehow I don&#8217;t think that&#8217;s right. There&#8217;s also no indication whatsoever as to what the funtion is doing. The user-defined function is represented by the Compute Scalar operator in the first plan.</p>
<p><img class="alignnone size-full wp-image-238" title="functionexecplan2" src="http://sqlinthewild.co.za/wp-content/uploads/2009/04/functionexecplan2.png" alt="" width="314" height="364" /></p>
<p>Again, there&#8217;s no indication at all as to the true cost of that function. Estimated I/O cost of 0? I somehow doubt it.</p>
<p>This is the main reason I have a problem with user-defined functions, the way they interact with both the query stats and the execution plan makes it very difficult to see what their impact really is when one&#8217;s doing performance testing. Someone who&#8217;s not very familiar with the intricacies and nuances of SQL, or who&#8217;s just using the exec plan without examining anything else may mistakenly conclude that user-defined functions perform well.</p>
<p>So, how do you actually see what a query that uses a function is actually doing? The only way is to haul out profiler and run a trace.</p>
<p>I&#8217;m going to start with just a trace on T-SQL:BatchCompleted, and I&#8217;m going to run those two queries separately to see if I can get a more accurate picture of the IO impact.</p>
<p>The profiler trace shows a very different picture in terms of IOs than Statistics IO did.<br />
<img class="alignnone size-full wp-image-239" title="functionprofiler1" src="http://sqlinthewild.co.za/wp-content/uploads/2009/04/functionprofiler1.png" alt="" width="501" height="43" /><br />
20000 reads for the query with the subquery, 4 and a half million for the query with the function. Ouch.</p>
<p>But wait, there&#8217;s more.</p>
<p>I&#8217;m going to run that profiler trace again, but this time I&#8217;m going to include the SP:Completed event.<br />
<a href="http://sqlinthewild.co.za/wp-content/uploads/2009/04/functionprofiler21.png"><img class="alignnone size-full wp-image-241" title="functionprofiler21" src="http://sqlinthewild.co.za/wp-content/uploads/2009/04/functionprofiler21.png" alt="" width="500" height="411" /></a></p>
<p>Running the query with the function resulted in 505 events picked up by profiler. One was the batch completed as the entire query completed its execution. the other 504 were all SP:Completed events. As I noted earlier, the query that I&#8217;m running here returns 504 rows.</p>
<p>The function is being executed once for each row in the query. That&#8217;s why the duration of the query is so high (each execution of the function takes between 70 and 140 ms) and it&#8217;s why the reads are so exceedingly high. Each time the function executes it&#8217;s (in my case) doing a table scan of the SalesOrderDetail table (I have no index on ProductID). If an index is added on that column, the performance of the function becomes much better, but that may not always be the case.</p>
<p>Now this was a fairly simple function. Imagine if the function was a few hundred lines of T-SQL with multiple queries in it. Not a pleasant thought.</p>
]]></content:encoded>
			<wfw:commentRss>http://sqlinthewild.co.za/index.php/2009/04/29/functions-io-statistics-and-the-execution-plan/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>On Counts</title>
		<link>http://sqlinthewild.co.za/index.php/2009/04/14/on-counts/</link>
		<comments>http://sqlinthewild.co.za/index.php/2009/04/14/on-counts/#comments</comments>
		<pubDate>Tue, 14 Apr 2009 14:58:57 +0000</pubDate>
		<dc:creator>Gail</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[T-SQL]]></category>

		<guid isPermaLink="false">http://sqlinthewild.co.za/?p=183</guid>
		<description><![CDATA[Or &#8220;What&#8217;s the fastest way to count the rows?&#8221; It&#8217;s a fairly common to need to know the number of rows in a table, the number of rows that match a certain condition or whether or not there are any rows that match a condition. There&#8217;s also a number of ways of doing so, some [...]]]></description>
			<content:encoded><![CDATA[<p>Or &#8220;<em>What&#8217;s the fastest way to count the rows?&#8221;</em></p>
<p>It&#8217;s a fairly common to need to know the number of rows in a table, the number of rows that match a certain condition or whether or not there are any rows that match a condition. There&#8217;s also a number of ways of doing so, some better than others. The problem being that counting is not a cheap operation, especially on big tables. It&#8217;s not as bad as a sort, but it still can be expensive.</p>
<p>So, given that, let&#8217;s take a look at some of the ways.</p>
<p><strong>Querying the metadata</strong></p>
<p>If all that&#8217;s needed is the number of rows in the table, and it&#8217;s not 100% important that the value be completely accurate all the time, the system metadata can be queried. In SQL 2000 and below, that info was in sysindexes. In 2005 and higher it&#8217;s been moved into sys.partitions.</p>
<p>SELECT OBJECT_NAME(object_id) AS TableName, SUM(rows) AS TotalRows<br />
FROM sys.partitions<br />
WHERE index_id in (0,1)<br />
AND object_id = OBJECT_ID(&#8216;TableName&#8217;)<br />
GROUP BY object_id</p>
<p>The advantage of this approach is that it is fast. Since it&#8217;s not actually counting anything and, in fact, isn&#8217;t even accessing the table that&#8217;s being counted, it&#8217;s the fastest way to get the count of rows in the table.</p>
<p>The disadvantage is it can only get the number of rows in the table and cannot consider any criteria at all. It also may not be 100% accurate, depending how and when the table&#8217;s rowcount metadata is updated by the SQL engine.<span id="more-183"></span></p>
<p><strong>Checking for existence of rows</strong></p>
<p>If all that&#8217;s needed is to know whether or not a row exists for a certain criteria, use EXISTS rather than counting the entire number of rows and checking that it&#8217;s greater than 0. It&#8217;s (slightly) quicker and it uses fewer IOs.</p>
<p>CREATE TABLE TestingCounts (<br />
ID INT IDENTITY PRIMARY KEY,<br />
AChar CHAR(1),<br />
SomePaddingColumn CHAR(400)<br />
)</p>
<p>INSERT INTO TestingCounts (SomePaddingColumn, AChar)<br />
SELECT TOP (100000) a.name, LEFT(b.name,1)<br />
FROM sys.columns a CROSS JOIN sys.columns b<br />
GO</p>
<p>SET STATISTICS IO ON<br />
SET STATISTICS TIME ON<br />
GO</p>
<p>DECLARE @RowCount INT<br />
SELECT @RowCount = COUNT(*) FROM TestingCounts WHERE ID&gt;50000<br />
IF (@RowCount &gt; 0)<br />
PRINT &#8216;Count &#8211; Rows Exist&#8217;</p>
<p>IF EXISTS (SELECT 1 FROM TestingCounts WHERE ID&gt;50000)<br />
PRINT &#8216;Exists &#8211; Rows Exist&#8217;</p>
<p>Count:<br />
Table &#8216;TestingCounts&#8217;. Scan count 1, logical reads 2645, physical reads 0.<br />
SQL Server Execution Times:<br />
CPU time = 15 ms,  elapsed time = 13 ms.</p>
<p>Exists:<br />
Table &#8216;TestingCounts&#8217;. Scan count 1, logical reads 3<br />
SQL Server Execution Times:<br />
CPU time = 0 ms,  elapsed time = 0 ms.</p>
<p><strong>Count(*) vs Count(ColumnName)</strong></p>
<p>The difference between these two is not always understood, and I&#8217;ve often seen hints and tips that say that COUNT(ColumnName) is better than COUNT(*). That advice is wrong and it&#8217;s possible that it comes from the (correct) advice that SELECT &lt;Column List&gt; is better than SELECT *.</p>
<p>In truth, COUNT(*) means to count the number of rows in the resultset. COUNT(ColumnName) means to count the number of rows in the resultset where that column is not null. Hence, COUNT(ColumnName) will return a different result to COUNT(*) is the column that&#8217;s used is nullable and contains any nulls.</p>
<p>So what does that mean for the performance of the two?</p>
<p>Because COUNT(*) just means count the rows then, assuming that there are no other criteria in the query, to satisfy it SQL will find the smallest index that exists on the table and scan the leaf pages to count the rows.</p>
<p>COUNT(ColumnName) requires that the column specified be checked to see if there are any null values. If the column that&#8217;s specified is defined as NOT NULLable, them SQL treats it just like a COUNT(*). If the column is nullable then, regardless or whether or not it contains any NULLs, it has to be checked. That means that to evaluate a COUNT(ColumnName), SQL must either scan an index that has that column in it or it must do a full table scan.</p>
<p>This means that, at best, a COUNT(Column) can be as fast as a COUNT(*), but it cannot be faster.</p>
<p>&#8211; using the same table created above</p>
<p>CREATE INDEX idx_AChar ON TestingCounts (Achar)<br />
GO</p>
<p>SELECT COUNT(*) FROM TestingCounts WHERE ID%3 = 0<br />
SELECT COUNT(SomePaddingColumn) FROM TestingCounts WHERE ID%3 = 0</p>
<p>Count(*):<br />
Table &#8216;TestingCounts&#8217;. Scan count 1, logical reads 138.<br />
SQL Server Execution Times:<br />
CPU time = 16 ms,  elapsed time = 15 ms.</p>
<p>COUNT(SomePaddingColumn):<br />
Table &#8216;TestingCounts&#8217;. Scan count 1, logical reads 5285<br />
SQL Server Execution Times:<br />
CPU time = 47 ms,  elapsed time = 47 ms.</p>
<p>Of course, if there are other conditions in the query or a group by, the situation becomes rather more complicated as to what indexes will be used. The main point though remains true. Because COUNT(ColumnName) has to check that column for NULL values and COUNT(*) does not, COUNT(ColumnName) cannot be the faster option.</p>
]]></content:encoded>
			<wfw:commentRss>http://sqlinthewild.co.za/index.php/2009/04/14/on-counts/feed/</wfw:commentRss>
		<slash:comments>22</slash:comments>
		</item>
		<item>
		<title>Dynamic SQL and SQL injection</title>
		<link>http://sqlinthewild.co.za/index.php/2009/04/03/dynamic-sql-and-sql-injection/</link>
		<comments>http://sqlinthewild.co.za/index.php/2009/04/03/dynamic-sql-and-sql-injection/#comments</comments>
		<pubDate>Fri, 03 Apr 2009 18:49:25 +0000</pubDate>
		<dc:creator>Gail</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[Syndication]]></category>
		<category><![CDATA[T-SQL]]></category>

		<guid isPermaLink="false">http://sqlinthewild.co.za/?p=230</guid>
		<description><![CDATA[When I wrote about the catch-all queries, someone asked why the dynamic SQL that I offered wasn&#8217;t vulnerable to SQL injection. I thought I&#8217;d go into the whys and the wherefores of that in a little bit more detail. I&#8217;m just going to look at SQL injection from the aspect of dynamic SQL. The front-end [...]]]></description>
			<content:encoded><![CDATA[<p>When I wrote about the catch-all queries, someone asked why the dynamic SQL that I offered wasn&#8217;t vulnerable to SQL injection. I thought I&#8217;d go into the whys and the wherefores of that in a little bit more detail.</p>
<p>I&#8217;m just going to look at SQL injection from the aspect of dynamic SQL. The front-end code aspect has been dealt with hundreds of times, most recently here &#8211; <a href="http://www.simple-talk.com/community/blogs/philfactor/archive/2009/03/30/72651.aspx">http://www.simple-talk.com/community/blogs/philfactor/archive/2009/03/30/72651.aspx</a></p>
<p>The most important thing to realise with SQL Injection (and with all other forms of command injection) is that it requires that a user-inputted string be incorperated as part of a command that&#8217;s going to be executed. Not as part of a paramter value, but as part of the command itself.</p>
<p>Let me show you want I mean.</p>
<p>DECLARE @sSQL varchar(500)<br />
SET @sSQL = &#8216;SELECT * FROM sys.objects&#8217;</p>
<p>EXECUTE (@sSQL)</p>
<p>In this exeedingly simple example, there&#8217;s no possibility for SQL injection. There&#8217;s no user-inputted string that can become part of the command. Let&#8217;s look at two slightly more complex examples</p>
<p><strong>Example 1:</strong></p>
<p>DECLARE @inputParam VARCHAR(100) &#8212; Assume this comes from user input<br />
DECLARE @sSQL varchar(500)</p>
<p>SET @sSQL = &#8216;SELECT * FROM &#8216;</p>
<p>IF @inputParam = &#8216;Table1&#8242;<br />
SET @sSQL = @sSQL + &#8216;Table1&#8242;<br />
IF @inputParam = &#8216;Table2&#8242;<br />
SET @sSQL = @sSQL + &#8216;Table2&#8242;<br />
IF @inputParam = &#8216;Table3&#8242;<br />
SET @sSQL = @sSQL + &#8216;Table3&#8242;<br />
IF @inputParam = &#8216;Table4&#8242;<br />
SET @sSQL = @sSQL + &#8216;Table4&#8242;</p>
<p>EXECUTE (@sSQL)</p>
<p><strong>Example 2:</strong></p>
<p>DECLARE @inputParam VARCHAR(100) &#8212; Assume this comes from user input<br />
DECLARE @sSQL varchar(500)</p>
<p>SET @sSQL = &#8216;SELECT * FROM &#8216; + @inputParam</p>
<p>EXECUTE (@sSQL)</p>
<p><span id="more-230"></span>Now, what about these two examples? Let&#8217;s assume that someone&#8217;s trying a SQL injection attack and has passed, for @inputParam, a value of &#8220;Table1; Drop Table Table1 &#8211;&#8221;</p>
<p>In example 1, that value that&#8217;s passed in does not match any of the IF conditions. Hence, the resulting SQL that will get executed is &#8216;SELECT * FROM &#8216;. That&#8217;s going to throw a syntax error, but nothing more. The malicious statement did not get injected into the command that was run. Hence, no SQL injection here.</p>
<p>What about example 2? For the same value of @inputParam, the command that will be executed is &#8216;SELECT * FROM Table1; Drop Table Table1 &#8211;&#8217;. When that&#8217;s run, assuming sufficient permissions, Table1 is going to be dropped. Not good.</p>
<p>In this case, because the input parameter was made a direct part of the string that was getting executed, there was a possibility of SQL injection; this example is vulnerable.</p>
<p>Now let&#8217;s look at a couple of examples similar to the one I gave in my previous post, ones with dynamic where clauses.</p>
<p><strong>Example 1:</strong></p>
<p>DECLARE @inputParam1 VARCHAR(100) &#8212; Assume this comes from user input<br />
DECLARE @inputParam2 VARCHAR(100) &#8212; Assume this comes from user input<br />
DECLARE @sSQL nvarchar(500)</p>
<p>SET @sSQL = &#8216;SELECT * FROM SomeTable WHERE Active = 1 &#8216;<br />
IF @inputParam1 IS NOT NULL<br />
SET @sSQL = @sSQL + &#8216; AND Column1 = @innerParameter1&#8242;<br />
IF @inputParam2 IS NOT NULL<br />
SET @sSQL = @sSQL + &#8216; AND Column2 = @innerParameter2&#8242;</p>
<p>exec sp_executesql @sSQL, &#8216;@innerParameter1 varchar(100), @innerParameter2 varchar(100)&#8217;, @innerParameter1 = @inputParam1, @innerParameter2 = @inputParam2</p>
<p><strong>Example 2:</strong></p>
<p>DECLARE @inputParam1 VARCHAR(100) &#8212; Assume this comes from user input<br />
DECLARE @inputParam2 VARCHAR(100) &#8212; Assume this comes from user input<br />
DECLARE @sSQL varchar(500)</p>
<p>SET @sSQL = &#8216;SELECT * FROM SomeTable WHERE Active = 1 &#8216;<br />
IF @inputParam1 IS NOT NULL<br />
SET @sSQL = @sSQL + &#8216; AND Column1 = &#8221;&#8217; + @inputParam1 + &#8221;&#8221;<br />
IF @inputParam2 IS NOT NULL<br />
SET @sSQL = @sSQL + &#8216; AND Column2 = &#8221;&#8217; + @inputParam2 + &#8221;&#8221;</p>
<p>EXECUTE (@sSQL)</p>
<p>In the first example, the imput parameters never become a direct part of the string that is being executed. They are used to control what portions are added to the string and they are passed, as parameters, to sp_executesql, but they themselves are not incorperated into the string.</p>
<p>In the second example, the parameters are used to control what portions are added to the string but they are also directly concatenated into the string. So whatever&#8217;s inside the parameters will become part of the string that is going to be executed.</p>
<p>So, what happens in this case if a malicious user passes, for inputParam1, this: &#8220;abc&#8217;; drop table SomeTable;&#8211;&#8221; and leaves inputParameter2 blank</p>
<p>In the first example, since inputParam1 has a value and inputParam2 does not, the resulting SQL string is</p>
<p>SELECT * FROM SomeTable<br />
WHERE Active = 1 AND Column1 = @innerParameter1</p>
<p>That is then executed by sp_executesql and the value with the attempted SQL injection is then passed as a parameter and the query executes looking for rows where Column1 has the actual value &#8220;abc&#8217;; drop table SomeTable;&#8211;&#8221; (which is quite unlikely to match anything). Since the input parameters did not become part of the string executed, there is no possibility for SQL injection here</p>
<p>What about the second example?</p>
<p>Well, in that example, if inputParam1 has the same value given in above and inputparam2 is blank, the resulting string that will be executed is</p>
<p>SELECT * FROM SomeTable<br />
WHERE Active = 1 AND Column1 = &#8216;abc&#8217;; drop table SomeTable;&#8211;&#8217;</p>
<p>Not good.</p>
<p>So, in summary, if a user-specified value is included as actual part of a SQL statement to be executed, it is vulnerable to SQL injection. If the parameters are used rather to control what the string looks like but are not made a direct part of it, then there is no opening for SQL injection. I hope this has cleared up at least a little bit of the confusion around the topic.</p>
]]></content:encoded>
			<wfw:commentRss>http://sqlinthewild.co.za/index.php/2009/04/03/dynamic-sql-and-sql-injection/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Catch-all queries</title>
		<link>http://sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/</link>
		<comments>http://sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/#comments</comments>
		<pubDate>Thu, 19 Mar 2009 21:43:51 +0000</pubDate>
		<dc:creator>Gail</dc:creator>
				<category><![CDATA[Performance]]></category>
		<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[Syndication]]></category>
		<category><![CDATA[T-SQL]]></category>

		<guid isPermaLink="false">http://sqlinthewild.co.za/?p=174</guid>
		<description><![CDATA[The query form that I refer to as &#8216;Catch-all&#8217; typically results from search screens in the application where the user may enter any one (or more) of a number of optional parameters. One of the more common ways for such a query to be written in SQL is with multiple predicates in the where clause [...]]]></description>
			<content:encoded><![CDATA[<p>The query form that I refer to as &#8216;Catch-all&#8217; typically results from search screens in the application where the user may enter any one (or more) of a number of optional parameters. One of the more common ways for such a query to be written in SQL is with multiple predicates in the where clause of the form (WHERE SomeColumn = @SomeVariable OR @SomeVariable IN NULL)</p>
<p>Now this does work, the problem is that it works fairly inefficiently and, on large tables, can result in really poor query performance. I&#8217;m going to take a look at why that is the case and what alternatives there are.</p>
<p>Erland Sommarskog has written on this as well, and in a lot more detail than I&#8217;m going to. His article on <a href="http://www.sommarskog.se/dyn-search-2005.html">dynamic search conditions</a> is well worth reading, as are the rest of his articles.</p>
<p>A typical example of a &#8216;catch-all&#8217; query would be this one, based off a table in the AdventureWorks database.</p>
<pre class="brush: sql;">CREATE PROCEDURE SearchHistory
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity,
TransactionDate, ActualCost from Production.TransactionHistory
WHERE (ProductID = @Product Or @Product IS NULL)
AND (ReferenceOrderID = @OrderID OR @OrderID Is NULL)
AND (TransactionType = @TransactionType OR @TransactionType Is NULL)
AND (Quantity = @Qty Or @Qty is null)
GO</pre>
<p>Now, let&#8217;s say that I run that query and pass values for the ProductID and the Transaction type. Let&#8217;s further say that there&#8217;s a nonclustered index (called idx_TranHistory_TranTypeProductID) on those two columns.</p>
<pre class="brush: sql;">EXEC SearchHistory @Product = 978, @TransactionType = 'W'</pre>
<p>Now this returns 52 rows out of 980000 that are in the table, so we&#8217;d expect that SQL would use an index seek operation on that index, followed by a bookmark lookup.<span id="more-174"></span></p>
<p><img class="alignnone size-full wp-image-223" title="catchall1" src="http://sqlinthewild.co.za/wp-content/uploads/2009/03/catchall1.png" alt="" width="444" height="182" /></p>
<p>Nope. It&#8217;s using that index all right, but it&#8217;s doing a scan, not a seek. Ok, not great, but not bad. Let me try a different set of parameters</p>
<pre class="brush: sql;">EXEC SearchHistory @Qty = 100</pre>
<p>The plan&#8217;s exactly the same. No surprise, it was cached the first time and then reused. There&#8217;s a problem here though, the index that&#8217;s used is completely inappropriate and there&#8217;s a bookmark lookup that ran almost a million times. No wonder this execution took 3 seconds and 2,949,715 IOs to return 29 rows.</p>
<p>Ok, so let me try a different form of the catch-all query</p>
<pre class="brush: sql;">CREATE PROCEDURE SearchHistory_Improved
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity, TransactionDate, ActualCost from Production.TransactionHistory
WHERE (ProductID = CASE WHEN @Product IS NULL THEN ProductID ELSE @Product END)
AND (ReferenceOrderID = CASE WHEN @OrderID IS NULL THEN ReferenceOrderID ELSE @OrderID END)
AND (TransactionType = CASE WHEN @TransactionType IS NULL THEN TransactionType ELSE @TransactionType END)
AND (Quantity = CASE WHEN @Qty IS NULL THEN Quantity ELSE @Qty END)
GO</pre>
<p>Let&#8217;s see what that does for the first test:</p>
<pre class="brush: sql;">EXEC SearchHistory_Improved @Product = 978, @TransactionType = 'W'</pre>
<p><img class="alignnone size-full wp-image-224" title="catchall2" src="http://sqlinthewild.co.za/wp-content/uploads/2009/03/catchall2.png" alt="" width="373" height="129" /></p>
<p>Well that&#8217;s no better. Full blown table scan.</p>
<p>The problem with these types of queries is that there is no stable plan. The optimal plan differs completely depending on what paramters are passed. The optimiser can tell that and it plays safe. It creates plans that will always work. That&#8217;s (one of the reasons) why in the first example it was an index scan, not an index seek.</p>
<p>The downside of the safe plan is that it&#8217;s highly unlikely to be a good plan and, even if it is, it won&#8217;t be good for all possible combinations of plans.</p>
<p>So, how to handle this type of query? Well, there are typically two ways.</p>
<p><strong>Recompile</strong></p>
<p>This is only an option on SQL 2008. On 2008, if the query is specified with the OPTION (RECOMPILE) hint, then the optimiser knows it doesn&#8217;t have to worry about safe plans because the plan will never be reused. In fact, if I add that hint to the query in the first example, I get the expected index seek.</p>
<p><strong>Dynamic SQL</strong></p>
<p>The other option is to build up the query string dynamically, based on the parameters passed and then to use sp_executesql to run it. There are the ususal downsides to dynamic SQL but, it may be that the performance improvement is worth it.</p>
<pre class="brush: sql;">CREATE PROCEDURE SearchHistory_Dynamic (@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
DECLARE @sSQL NVARCHAR(2000), @Where NVARCHAR(1000) = ''
SET @sSQL = 'SELECT ProductID, ReferenceOrderID, TransactionType, Quantity, TransactionDate, ActualCost
from Production.TransactionHistory '

IF @Product is not null
SET @Where = @Where + 'AND ProductID = @_Product '
IF @OrderID is not null
SET @Where = @Where + 'AND ReferenceOrderID = @_OrderID '
IF @TransactionType IS NOT NULL
SET @Where = @Where + 'AND TransactionType = @_TransactionType '
IF @Qty IS NOT NULL
SET @Where = @Where + 'AND Quantity = @_Qty '

IF LEN(@Where) &gt; 0
SET @sSQL = @sSQL + 'WHERE ' + RIGHT(@Where, LEN(@Where)-3)

EXEC sp_executesql @sSQL,
N'@_Product int, @_OrderID int, @_TransactionType char(1), @_Qty int',
@_Product = @Product, @_OrderID = @OrderID, @_TransactionType = @TransactionType, @_Qty = @Qty

GO</pre>
<p>Note that there&#8217;s no SQL injection vulnerability in this. The parameters are never concatenated into the string and the execution is parametrised.</p>
<p>Now each different set of parameters gets a different cached plan, optimal for that particular set of parameters.</p>
<pre class="brush: sql;">EXEC SearchHistory_Dynamic @Product = 978, @TransactionType = 'W'

EXEC SearchHistory_Dynamic @Qty = 100</pre>
<p>The first gets an index seek, the second a clustered index scan (because there&#8217;s no index on Quantity). Much better than the behaviour with the earlier non-dynamic versions.</p>
]]></content:encoded>
			<wfw:commentRss>http://sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/feed/</wfw:commentRss>
		<slash:comments>29</slash:comments>
		</item>
		<item>
		<title>On the OUTPUT of a data modification</title>
		<link>http://sqlinthewild.co.za/index.php/2008/12/31/on-the-output-of-a-data-modification/</link>
		<comments>http://sqlinthewild.co.za/index.php/2008/12/31/on-the-output-of-a-data-modification/#comments</comments>
		<pubDate>Wed, 31 Dec 2008 12:35:38 +0000</pubDate>
		<dc:creator>Gail</dc:creator>
				<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[T-SQL]]></category>

		<guid isPermaLink="false">http://sqlinthewild.co.za/?p=124</guid>
		<description><![CDATA[or &#8220;Who needs a trigger anyway?&#8221; The output clause was, I think, one of those wonderful features of SQL 2005 that very few people used, myself included. Now in 2008, it&#8217;s even better, but still doesn&#8217;t appear to be widely used. The output clause can be used to get, as a resultset, data from the [...]]]></description>
			<content:encoded><![CDATA[<p>or &#8220;<em>Who needs a trigger anyway?</em>&#8221;</p>
<p>The output clause was, I think, one of those wonderful features of SQL 2005 that very few people used, myself included. Now in 2008, it&#8217;s even better, but still doesn&#8217;t appear to be widely used.</p>
<p>The output clause can be used to get, as a resultset, data from the inserted and deleted tables that are usually only visible in a trigger. As a very simple example:</p>
<p>Create Table #Testing (<br />
id int identity,<br />
somedate datetime default getdate()<br />
)</p>
<p>insert into #Testing<br />
output inserted.*<br />
default values </p>
<p>Neat. We can get back the inserted values as a result set. We can also insert them into a table variable for later processing. Using the same temp table</p>
<p>declare @OutputTable TABLE (id int, somedate datetime)</p>
<p>insert into #Testing<br />
output inserted.* into @OutputTable<br />
default values</p>
<p>select * from @OutputTable </p>
<p>Very neat. Now how about a practical example? Say we have the following three tables in a database.</p>
<p>Create Table ParentTable (<br />
ID int identity primary key,<br />
ParentDescription varchar(50),<br />
CreationDate DATETIME DEFAULT GETDATE()<br />
)<br />
GO<br />
CREATE TABLE ChildTable (<br />
ID Int identity Primary Key,<br />
ParentID int not null constraint fk_parent foreign key references ParentTable(ID),<br />
Somedescription varchar(20),<br />
SomeValue Money<br />
)<br />
GO</p>
<p>Create Table AuditTable (<br />
AuditID int identity primary key,<br />
ChildID int,<br />
SomeValue Money,<br />
InsertDate DATETIME DEFAULT GETDATE(),<br />
OriginatingLogin VARCHAR(50) DEFAULT ORIGINAL_LOGIN()<br />
)</p>
<p>We get a set of data (perhaps in a temp table, perhaps in an xml document) that needs to be inserted into those tables. The source data will have multiple parent rows, each with multiple child rows. Those need to be inserted into the appropriate tables and the foreign keys have to be assigned correctly. In addition, the ID of the child rows, along with the value and the current date must be written into an audit table, along with the login name of the current user.</p>
<p><span id="more-124"></span>It&#8217;s not a difficult requirement, but because the IDs are assigned when the insert happens, it requires the data be selected back, or the use of a trigger. @@identity (or the other identity functions) can&#8217;t be used because there will be multiple rows.</p>
<p>So, how will the output clause help us here? First, for SQL 2005.</p>
<p>Source data:</p>
<p>CREATE TABLE #SourceData (<br />
ParentDescription Varchar(50),<br />
ChildDescription varchar(50),<br />
TheValue Money<br />
)</p>
<p>insert into #SourceData (ParentDescription, ChildDescription, TheValue)<br />
values (&#8216;aaa&#8217;, &#8216;a1&#8242;, 1.02)<br />
insert into #SourceData (ParentDescription, ChildDescription, TheValue)<br />
values (&#8216;aaa&#8217;, &#8216;a2&#8242;, 58.2)<br />
insert into #SourceData (ParentDescription, ChildDescription, TheValue)<br />
values (&#8216;aaa&#8217;, &#8216;b1&#8242;, 18.42)<br />
insert into #SourceData (ParentDescription, ChildDescription, TheValue)<br />
values (&#8216;bbb&#8217;, &#8216;c1&#8242;, 0.59)<br />
insert into #SourceData (ParentDescription, ChildDescription, TheValue)<br />
values (&#8216;ccc&#8217;, &#8216;z4&#8242;, 78.25)<br />
insert into #SourceData (ParentDescription, ChildDescription, TheValue)<br />
values (&#8216;ccc&#8217;, &#8216;z5&#8242;, 85.2) </p>
<p>Insert code:</p>
<p>DECLARE @Parent Table (id int, Descr Varchar(50))<br />
DECLARE @Child TABLE (id int, parentID int, SomeValue money)</p>
<p>INSERT INTO ParentTable (ParentDescription)<br />
Output Inserted.ID, Inserted. ParentDescription Into @Parent<br />
SELECT DISTINCT ParentDescription from #SourceData</p>
<p>INSERT INTO ChildTable (ParentID, Somedescription, someValue)<br />
Output inserted.ID, Inserted.ParentID, Inserted.someValue into @Child<br />
SELECT ID, ChildDescription, TheValue<br />
FROM @Parent p inner join #SourceData s on p.Descr = s.ParentDescription</p>
<p>INSERT INTO AuditTable (StartParentRange, EndParentRange, StartChildRange, EndChildRange, TotalChildRows, TotalValue)<br />
SELECT Min(ParentID), MAX(ParentID), Min(id), max(id), Count(*), SUM(SomeValue)<br />
FROM @Child</p>
<p>That&#8217;s fairly nice. No triggers, no need to insert the parents 1 by 1 to get at the identity values, no need to reselect from the tables after the insert (which can be expensive if they&#8217;re large)</p>
<p>On SQL 2005, that&#8217;s the best that&#8217;s possible, as the output can only be inserted into a table variable, or returned as a result set.</p>
<p>In 2008, the output from a data modification can be used as the source for another insert statement. There are a lot of restrictions to it, so it&#8217;s not the most useful of features at the moment. Joins are not allowed so the data modification has to be the sole source for the second insert statement. Aggregations are also not allowed. The destination table must not have foreign keys. Still it will allow us to reduce the table variables by one.</p>
<p>With the same source data as the prior example, for SQL 2008:</p>
<p>DECLARE @Parent Table (id int, Descr Varchar(50))</p>
<p>INSERT INTO ParentTable (ParentDescription)<br />
Output Inserted.ID, Inserted. ParentDescription Into @Parent<br />
SELECT DISTINCT ParentDescription from #SourceData</p>
<p>Insert into AuditTable (ChildID, SomeValue)<br />
SELECT id, SomeValue<br />
FROM<br />
(INSERT INTO ChildTable (ParentID, Somedescription, someValue)<br />
Output inserted.ID, Inserted.someValue<br />
SELECT ID, ChildDescription, TheValue<br />
FROM @Parent p inner join #SourceData s on p.Descr = s.ParentDescription) AS i (id, somevalue)</p>
<p>With all the restrictions on the nested inserts, they&#8217;re not as useful as they seem, but it is an interesting technique, has at least a few uses and hopefully will be less restricted in future versions.</p>
]]></content:encoded>
			<wfw:commentRss>http://sqlinthewild.co.za/index.php/2008/12/31/on-the-output-of-a-data-modification/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
