Syndication

Does Repair_Allow_Data_Loss cause data loss?

Short answer: Yes. That’s why the option has that name.

Long answer:

When CheckDB finds corruption it will, as part of the output, include the minimum level of repair that will fix the errors that it found. In SQL 2005 and higher, it will look something like the following.

CHECKDB found 42 allocation errors and 137 consistency errors in database ‘SomeDB’.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (SomeDB).

There are two options for the minimum repair level. REPAIR_REBUILD and REPAIR_ALLOW_DATA_LOSS.

If the minimum level is REPAIR_REBUILD then it means that the corruption was confined to just the nonclustered indexes. This is good. Nonclustered indexes are redundant data and hence can be dropped and recreated (which is pretty much what CheckDB with the REPAIR_REBUILD option does)

If the minumum level is REPAIR_ALLOW_DATA_LOSS then running CheckDB with the REPAIR_ALLOW_DATA_LOSS option will cause data loss. That’s why the option is named like that. How much it will lose depends on the extent of the corruption.

There’s one other case. What about when checkDB doesn’t specify a minimum level?

CHECKDB found 3 allocation errors and 7 consistency errors in database ‘SomeDB’.

That means that there’s damage to one or more of the allocation pages or there’s corruption in the system tables. That kind of corruption isn’t repairable, so no minimum repair level can be specified.

So, in conclusion, will running CheckDB with the REPAIR_ALLOW_DATA_LOSS option cause data loss? If that’s the minimum level that CheckDB specified, then yes it will.

For more info, see Paul Randal’s blog and an article that I wrote at SQL Server Central.

Edit: There is one case where the repair level repair_allow_data_loss is required, but no data will be lost. This is the case of the incorrect PFS page (see my article at SQLServerCentral for more info). If that is the only error, repair_allow_data_loss will be the minimum level, but no data will actually be lost.

Functions, IO statistics and the Execution plan

It’s no secret that I’m not overly fond of most user-defined functions. This isn’t just a pet hate, I have some good reasons for disliking them. All too often they’re performance bottlenecks, but that can be said about many things in SQL. The bigger problem is that they’re hidden performance bottlenecks that often go overlooked and ignored for too long.

I’m going to start with this fairly simple scalar function, created in the AdventureWorks database

Create function LineItemTotal(@ProductID int)
returns money
as
begin
declare @Total money

select @Total = sum(LineTotal) from sales.SalesOrderDetail where productid = @ProductID

return @Total
end

So, given that function, the following two queries should be equivalent.

SELECT productid, productnumber, dbo.LineItemTotal(productid) as SumTotal
FROM Production.Product p

SELECT productid, productnumber,
(select sum(LineTotal) from sales.SalesOrderDetail where productid = p.productid) AS SumTotal
FROM Production.Product p

No problems so far. They both return 504 rows (in my copy of AW, which has been slightly padded out with more data). Now, let’s look at the execution characteristics by running them again with Statistics IO and Statistics Time on.

Query 1, the one with the scalar function:

Table ‘Product’. Scan count 1, logical reads 4, physical reads 0.

SQL Server Execution Times:
CPU time = 47297 ms,  elapsed time = 47541 ms.

Query 2, the one with the correlated subquery:

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘SalesOrderDetail’. Scan count 3, logical reads 22536, physical reads 0.
Table ‘Product’. Scan count 3, logical reads 40, physical reads 0.

SQL Server Execution Times:
CPU time = 1047 ms, elapsed time = 1249 ms.

(more…)

Dynamic SQL and SQL injection

When I wrote about the catch-all queries, someone asked why the dynamic SQL that I offered wasn’t vulnerable to SQL injection. I thought I’d go into the whys and the wherefores of that in a little bit more detail.

I’m just going to look at SQL injection from the aspect of dynamic SQL. The front-end code aspect has been dealt with hundreds of times, most recently here – http://www.simple-talk.com/community/blogs/philfactor/archive/2009/03/30/72651.aspx

The most important thing to realise with SQL Injection (and with all other forms of command injection) is that it requires that a user-inputted string be incorperated as part of a command that’s going to be executed. Not as part of a paramter value, but as part of the command itself.

Let me show you want I mean.

DECLARE @sSQL varchar(500)
SET @sSQL = 'SELECT * FROM sys.objects'

EXECUTE (@sSQL)

In this exeedingly simple example, there’s no possibility for SQL injection. There’s no user-inputted string that can become part of the command. Let’s look at two slightly more complex examples

Example 1:

DECLARE @inputParam VARCHAR(100) -- Assume this comes from user input
DECLARE @sSQL varchar(500)

SET @sSQL = 'SELECT * FROM '

IF @inputParam = 'Table1'
SET @sSQL = @sSQL + 'Table1'
IF @inputParam = 'Table2'
SET @sSQL = @sSQL + 'Table2'
IF @inputParam = 'Table3'
SET @sSQL = @sSQL + 'Table3'
IF @inputParam = 'Table4'
SET @sSQL = @sSQL + 'Table4'

EXECUTE (@sSQL)

Example 2:

DECLARE @inputParam VARCHAR(100) -- Assume this comes from user input
DECLARE @sSQL varchar(500)

SET @sSQL = 'SELECT * FROM ' + @inputParam

EXECUTE (@sSQL)

(more…)

Catch-all queries

13 March 2018: There is an updated post on this subject – https://www.sqlinthewild.co.za/index.php/2018/03/13/revisiting-catch-all-queries/

The query form that I refer to as ‘Catch-all’ typically results from search screens in the application where the user may enter any one (or more) of a number of optional parameters. One of the more common ways for such a query to be written in SQL is with multiple predicates in the where clause of the form (WHERE SomeColumn = @SomeVariable OR @SomeVariable IN NULL)

Now this does work, the problem is that it works fairly inefficiently and, on large tables, can result in really poor query performance. I’m going to take a look at why that is the case and what alternatives there are.

Erland Sommarskog has written on this as well, and in a lot more detail than I’m going to. His article on dynamic search conditions is well worth reading, as are the rest of his articles.

A typical example of a ‘catch-all’ query would be this one, based off a table in the AdventureWorks database.

CREATE PROCEDURE SearchHistory
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity,
TransactionDate, ActualCost from Production.TransactionHistory
WHERE (ProductID = @Product Or @Product IS NULL)
AND (ReferenceOrderID = @OrderID OR @OrderID Is NULL)
AND (TransactionType = @TransactionType OR @TransactionType Is NULL)
AND (Quantity = @Qty Or @Qty is null)
GO

Now, let’s say that I run that query and pass values for the ProductID and the Transaction type. Let’s further say that there’s a nonclustered index (called idx_TranHistory_TranTypeProductID) on those two columns.

EXEC SearchHistory @Product = 978, @TransactionType = 'W'

Now this returns 52 rows out of 980000 that are in the table, so we’d expect that SQL would use an index seek operation on that index, followed by a bookmark lookup.

Nope. It’s using that index all right, but it’s doing a scan, not a seek. Ok, not great, but not bad. Let me try a different set of parameters

EXEC SearchHistory @Qty = 100

The plan’s exactly the same. No surprise, it was cached the first time and then reused. There’s a problem here though, the index that’s used is completely inappropriate and there’s a bookmark lookup that ran almost a million times. No wonder this execution took 3 seconds and 2,949,715 IOs to return 29 rows.

Ok, so let me try a different form of the catch-all query

CREATE PROCEDURE SearchHistory_Improved
(@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
SELECT ProductID, ReferenceOrderID, TransactionType, Quantity, TransactionDate, ActualCost from Production.TransactionHistory
WHERE (ProductID = CASE WHEN @Product IS NULL THEN ProductID ELSE @Product END)
AND (ReferenceOrderID = CASE WHEN @OrderID IS NULL THEN ReferenceOrderID ELSE @OrderID END)
AND (TransactionType = CASE WHEN @TransactionType IS NULL THEN TransactionType ELSE @TransactionType END)
AND (Quantity = CASE WHEN @Qty IS NULL THEN Quantity ELSE @Qty END)
GO

Let’s see what that does for the first test:

EXEC SearchHistory_Improved @Product = 978, @TransactionType = 'W'

Well that’s no better. Full blown table scan.

The problem with these types of queries is that there is no stable plan. The optimal plan differs completely depending on what paramters are passed. The optimiser can tell that and it plays safe. It creates plans that will always work. That’s (one of the reasons) why in the first example it was an index scan, not an index seek.

The downside of the safe plan is that it’s highly unlikely to be a good plan and, even if it is, it won’t be good for all possible combinations of plans.

So, how to handle this type of query? Well, there are typically two ways.

Recompile

This is only an option on SQL 2008. On 2008, if the query is specified with the OPTION (RECOMPILE) hint, then the optimiser knows it doesn’t have to worry about safe plans because the plan will never be reused. In fact, if I add that hint to the query in the first example, I get the expected index seek.

Dynamic SQL

The other option is to build up the query string dynamically, based on the parameters passed and then to use sp_executesql to run it. There are the ususal downsides to dynamic SQL but, it may be that the performance improvement is worth it.

CREATE PROCEDURE SearchHistory_Dynamic (@Product int = NULL, @OrderID int = NULL, @TransactionType char(1) = NULL, @Qty int = NULL)
AS
DECLARE @sSQL NVARCHAR(2000), @Where NVARCHAR(1000) = ''
SET @sSQL = 'SELECT ProductID, ReferenceOrderID, TransactionType, Quantity, TransactionDate, ActualCost
from Production.TransactionHistory '

IF @Product is not null
SET @Where = @Where + 'AND ProductID = @_Product '
IF @OrderID is not null
SET @Where = @Where + 'AND ReferenceOrderID = @_OrderID '
IF @TransactionType IS NOT NULL
SET @Where = @Where + 'AND TransactionType = @_TransactionType '
IF @Qty IS NOT NULL
SET @Where = @Where + 'AND Quantity = @_Qty '

IF LEN(@Where) > 0
SET @sSQL = @sSQL + 'WHERE ' + RIGHT(@Where, LEN(@Where)-3)

EXEC sp_executesql @sSQL,
N'@_Product int, @_OrderID int, @_TransactionType char(1), @_Qty int',
@_Product = @Product, @_OrderID = @OrderID, @_TransactionType = @TransactionType, @_Qty = @Qty

GO

Note that there’s no SQL injection vulnerability in this. The parameters are never concatenated into the string and the execution is parametrised.

Now each different set of parameters gets a different cached plan, optimal for that particular set of parameters.

EXEC SearchHistory_Dynamic @Product = 978, @TransactionType = 'W'

EXEC SearchHistory_Dynamic @Qty = 100

The first gets an index seek, the second a clustered index scan (because there’s no index on Quantity). Much better than the behaviour with the earlier non-dynamic versions.

When is a seek actually a scan?

Most people who know SQL execution plans will say, without reservation, that an index seek on a particular index is better than an index scan on the same index. In the vast majority of cases, that’s true, but there are times when what appears in the execution plan as an index seek is actually an index scan.

Let me show an example

CREATE TABLE TestingSeeks (
id int identity (1,1) not null,
SomeStr char(6) default '' -- a filler
)
GO

insert into TestingSeeks (SomeStr)
select top (500000) ''
from sys.columns c1 cross join sys.columns c2

We have a table here with an identity column on it, starting at 1 and incrementing by 1 row. Hence, there will be no negative values in the table. I’m going to then put a nonclustered index on that column (the table has no cluster, it’s a heap)

CREATE NONCLUSTERED INDEX idx_Seek ON TestingSeeks (id)

Fair enough. If I query all the rows in the table and retrieve just the ID column, I’ll get a scan on that index, as is pretty much expected and Statistics IO tells me that 935 pages were read

(more…)

Estimated and Actual execution plan revisited

After an interesting discussion on SQLServerCentral last week, I realised that the terms ‘estimated execution plan’ and ‘actual execution plan’ are perhaps a little bit misleading.

The only thing estimated about the estimated execution plan is the rowcounts, costs and row size. The plan itself isn’t an estimate. It’s not as if the optimiser, when asked for an estimated plan, does a less thorough job than when asked to compile a plan for a query’s execution.

The two forms of execution plan are better described as ‘execution plan with run-time information’ and ‘execution plan without run-time information’

When, in Management Studio, someone clicks the ‘display estimated execution plan’ button, the query is submitted to SQL Server, parsed and bound, algebratised and optimised just as if it was going to be executed. But the query is not executed, and as such, the plan when returned contains no run time information.

If there is a matching cached query plan, that cached plan is what’s returned and no optimisation is done. This can be seen by using profiler with the Cache hit, cache miss and cache insert events being traced.

When, in Management Studio, the query is run with the execution plan option enabled, the query is submitted to SQL Server, parsed and bound, algebratised, optimised and executed. The returned plan does contain the run-time for that specific execution, hence the plan contains things like ‘actual row count, actual IO cost’, etc

If there’s a matching query plan in cache then that cached plan will be used for the query’s execution and will be the one returned, though with the run-time information added

When a plan is cached, only the compile-time information is cached. The detailed run-time information on the actual number of rows and actual executions is discarded after updating the aggregated query stats. Hence, when you retrieve a query from the plan cache, it will not contain the run-time information. Consider a plan that’s been used 20 times. Which execution’s run-time information would it contain? Remember that there’s only one plan in cache per procedure.

Hence, a plan fetched from cache will be identical to the plan returned by requesting the estimated execution plan for a specific query (Providing there’s nothing happening to invalidate the cached plan)

Profiler can capture (depending on event) the plan without the run-time information or the plan with the run-time information. There’s a nice table in chapter 2 of one of Itzik’s books that shows the various profiler events, when they fire and what they return.

All indexes are unique

Well, that’s a rather contentious title. There are probably several people shaking their heads at this point. Let me explain.

I was listening to a podcast with Kimberly Tripp this morning, and she mentioned this briefly. I thought it would be a good discussion to end a short series on indexes and selectivity.

The Clustered Index

A clustered index has to be unique, because the clustering key acts as the row’s location in the table. If the index is not defined as unique, SQL will make it unique by adding a uniquifier, a 4-byte integer that’s hidden behind the scenes and is added when necessary to make the clustered index unique.

It’s not documented anywhere clearly, but it is mentioned in a couple of places. From msdn:

If the clustered index is not a unique index, SQL Server makes any duplicate keys unique by adding an internally generated value called a uniqueifier. This four-byte value is not visible to users. It is only added when required to make the clustered key unique for use in nonclustered indexes. SQL Server retrieves the data row by searching the clustered index using the clustered index key stored in the leaf row of the nonclustered index.

So all clustered indexes are unique.

(more…)

Index columns, selectivity and inequality predicates

So, following on from my post last week, I’m going to take a look at how selectivity and index column order affect inequality predicates.

One thing to note straight off is that the selectivity of a column is much less important for inequality predicates than it was for equality. For equality predicates, the selectivity alone can give a reasonable idea of the number of rows a particular predicate will return. That’s not the case with inequalities. Also, with inequality predicates, the order of columns in the index becomes very important.

One of the most important considerations with inequality predicates is the number of rows that the predicate will return. An identity column may be highly selective, but if the filter is for all rows > 0 and the identity values start t one, then an index on that column is not going to be very useful.

The other consideration when there are inequality predicates is that only that column and columns to the left of it in the index key can be used for index seeks. Any columns to the right of the column with the inequality is no longer eligible for seeking.

To explain with an example, consider our hypothetical table from the previous post (with one small change):

CREATE TABLE ConsideringIndexOrder (
ID INT,
SomeString VARCHAR (100),
SomeDate DATETIME DEFAULT GETDATE()
);  

The same as previously, there’s a single nonclustered index on all three columns, in the order ID, SomeDate, SomeString.

If there’s an inequality predicate, then then the index is only fully seekable for the following queries
…  WHERE ID = @ID AND SomeDate = @dt AND SomeString > @str
…  WHERE ID = @ID AND SomeDate > @dt
…  WHERE ID > @ID

(more…)

A Bookmark lookup, by any other name…

I think I may have confused some people by talking about bookmark lookups. I’ll attempt to clarify things.

The operator that I’m talking about is the one that fetches extra columns from the clustered index when the nonclustered index that’s used to retrieve the rows doesn’t have all of the columns required.

In SQL 2000, that operator appeared in the execution plan as a bookmark lookup and it appeared as follows:

In SQL 2005, the name was changed, and the bookmark lookup no longer appeared in the execution plan. In it’s place appeared a clustered index seek, joined back to the original index seek by a nested loop join. It appeared as follows (the highlighted operator is the ‘bookmark lookup’)

This change made it harder to see what was going on as clustered index seeks are usually considered ‘good’. The trick to see if it is actually a ‘bookmark lookup’ is to look at the objects involved. When the nonclustered index seek and the clustered index seek are both on the same table, then it’s a ‘bookmark lookup’

I don’t recall what this appeared as when the base table was a heap, not a cluster.

In SQL 2005 SP2, the name of the operator was changed again, now appearing as a key lookup (when the base table has a clustered index) or a RID lookup (when the base table is a heap). It now looks like this:

The thing to note is that it’s not the version of the server that’s important. The format of the XML for the exec plan has not changed since SQL 2005 RTM (I can and have created a .sqlplan file from SQL 2008 and opened that file in SQL 2005’s management studio).

It’s the version of management studio that affects how the execution plans are displayed. If the server is SQL 2005 SP3, but the client tools are still RTM, the bookmark lookup will appear as a clustered index seek. Another reason to patch the client as well as the server

I hope that clears up some of the confusion around the naming. So, in future, what should I refer to this as? A bookmark lookup? A Key lookup?