SQL Server

Are int joins faster than string joins?

This one comes up a lot on the forums, often as advice given…

“You should use integers for the key rather than strings. The joins will be faster.”

It sounds logical. Surely integers are easier to compare than large complex strings. Question is, is it true?

This is going to take lots and lots of rows to get any significant results. I’m going to do two sets of tests. The first comparing query execution speed between string and integer data types on the join column while keeping the size of the join column the same between the two. This is to tell if there’s a difference just because of the data type.

The second test will have both the data type and the size of the join columns differing, while the total size of the table row will be kept the same. This is to answer the ‘string joins are slower because they are larger’ argument.

Test 1: Same key size, no indexes

The two tables have the same size join column – a bigint in one and a char(8) in the other.

SELECT t1.ID, t2.ID, t1.IntForeignKey, t2.SomeArbStatus
FROM dbo.TestingJoinsInt t1
INNER JOIN dbo.LookupTableInt t2 ON t1.IntForeignKey = t2.ID
GO

SELECT t1.ID, t2.ID, t1.StrForeignKey, t2.SomeArbStatus
FROM dbo.TestingJoinsString t1
INNER JOIN dbo.LookupTableString t2 ON t1.StrForeignKey = t2.ID
GO

First up, a check of what Statistics IO and Statistics Time show for the two and whether there’s any difference in execution plan. (Full repo code is available for download, link at the end of the post)

Int joins

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsInt’. Scan count 1, logical reads 66036, physical reads 0.
Table ‘LookupTableInt’. Scan count 1, logical reads 735, physical reads 0.

SQL Server Execution Times:
CPU time = 2433 ms,  elapsed time = 32574 ms.

IntJoins1

String joins

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsString’. Scan count 1, logical reads 66036, physical reads 0.
Table ‘LookupTableString’. Scan count 1, logical reads 735, physical reads 0.

SQL Server Execution Times:
CPU time = 3744 ms,  elapsed time = 33947 ms.

StringJoins1

(more…)

Quest Webcast: Dos and Don’ts of Database corruption

I was guest presenter on last week’s Quest Pain of the Week presentation. The presentation is available for download from Quest. This is a roundup of the questions from the presentation. (I have edited some of the questions for spelling, grammar and readability)

Q: Why do you have to wait for a maintenance window to take a backup? you can backup during the time the database is being used.

You don’t have to wait, and with a 24×7 system you obviously can’t. But full backups do have an impact even though database can be in use at the time. They’re IO intensive. If you’re going to do backups while the system is in use, make sure that the additional IO load doesn’t adversely effect user performance, or look into things like SAN-level snapshot backups.

Q: If you partition your data, can you only backup the latest partition?

Yes, however be mindful of the requirements for restoring from file/filegroup backups. In full or bulk-logged recovery if the older filegroups are not read-only, you’ll need log backups covering the entire span from the oldest of the file/filegroup backups you use to the newest.

If the partitions with older data are read-only, then this is very easy to do.

For more information see http://msdn.microsoft.com/en-us/library/ms177425.aspx

Q: What if you inherit a database that fails the integrity check every day and the backups have the same issue so you don’t know when the integrity check originally failed?

That just means that restore from backup (full, file/filegroup or page) is not an option for fixing. Probably means that there will be no avoiding data loss.

Q: The 825 errors – these are reported in the logs too aren’t they? I can review to see if these are occurring, even if the alert is not set?

Yes they are (and I don’t know why I thought during the presentation that they aren’t, I’ve written a blog post on them myself). https://www.sqlinthewild.co.za/index.php/2008/12/06/when-is-a-critical-io-error-not-a-critical-io-error/

Msg 825, Level 10, State 2, Line 1.
A read of the file ‘D:\Data\SomeDB.mdf’ at offset 0x00000020e24000 succeeded after failing 1 time(s) with error: incorrect checksum (expected: 0x7532c420; actual: 0x320e4240). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Q: Can you use Resource Governor for a large DB backup to keep the DB’s performance at an acceptable level?

Resource governor only limits CPU and memory, not IO.

Q: If you have data loss, shouldn’t you be doing a restore?

Maybe. It could be (due to time to restore or other constraints) that the fastest and acceptable solution is to repair (CheckTable or CheckFileGroup) and then sync in missing data later.

Q: I just had this happen on an archive database that I can play with. The problem occurred with lob’s. Since allow data loss removes the entire page that the corruption is on how does this work with lob’s that are not stored on the same page as the record?

The entire LOB chain (all LOB pages for the row) will be deallocated and the data/index record which is the parent will be deleted. (Ref: Chapter 11 of SQL Server 2008 Internals by Kalen Delaney et al. Chapter 11 written by Paul Randal)

Q: Is it possible to create a corrupted database in a test lab to get an 825 error?

Error 825 is a hard one to generate. To get that you need for an IO to fail or the retrieved page to be damaged and then on a retry (which occurs immediately) for the IO to succeed and the page to be correct. That’s hard to fake. You’ll need an IO filter driver that is coded to intermittently mess up IOs.

For 824, either use a hex editor to edit pages of a test database, or download and restore Paul Randal’s pre-corrupted databases.

For 823, create a database with a secondary file on a flash drive. Create a large table in the filegroup on the flash drive, checkpoint, DBCC dropcleanbuffers, then pull the flash drive out and query the table. If that doesn’t throw the error (meaning the pages are still in memory), update a row and checkpoint.

It goes without saying that neither of these should ever be done on a server!

Q: In case of db suspect (not recoverable) can we update the system table sysdatabases and change the status to make it online (this solution used to work in SQL 2000)

That is a SQL 2000 solution. In SQL 2005 and above, sysdatabases isn’t a table and the system tables cannot be updated as was possible in SQL 2000.

Fortunately there’s a fully documented and supported alternative – ALTER DATABASE <Database name> SET EMERGENCY. Once in emergency mode you can run CheckDB with repair_allow_data_loss and then bring the database online (assuming the repair succeeds), or extract data if the repair fails.

Q: Is Check DB the only tool available?

To check database integrity, yes.

Q: If you got a foreign key, you can restore a table but will integrity not be broken?

Relational and possibly transactional integrity may be broken. I’d guess that’s one reason table-level restores aren’t available natively in the product.

Q: Checksum on database backup – this is not an option in maintenance plan – true? To enable this, I have to script all database backups?

Correct, it’s not an option on maintenance plans. Personally I’m not a fan of maintenance plans for anything other than trivial environments.

Q: We use some SQL Express, do all these items work in Express? I had one that the hard drive died and I was able to get it back from suspect at which point I did a full backup and restored it elsewhere on a full SQL Server to fix it

The only limitation for CheckDB on Express is no parallelism.

File/filegroup restores can be done on Express, though they are offline restores.

Page restores can be done on Express, though they are offline restores


Thanks to Paul Randal and Robert Davis for assistance with a couple of these answers.

Is a clustered index best for range queries?

I see a lot of advice that talks about the clustered index been the best index for use for range queries, that is queries with inequalities filters, queries that retrieve ranges of rows, as opposed to singleton queries, queries that retrieve single rows (including, unfortunately, a Technet article).

I suspect the reasoning behind this advice is the idea that the clustered index stores the data in order of the clustering key (ack) and hence it’s ‘logical’ that such a structure would be best for range scans as SQL can simply start at the beginning of the range and read sequentially to the end.

Question is, is that really the case?

Let’s do some experiments and find out.

CREATE TABLE TestingRangeQueries (
ID INT IDENTITY,
SomeValue NUMERIC(7,2),
Filler CHAR(500) DEFAULT ''
)

-- 1 million rows
INSERT INTO TestingRangeQueries (SomeValue)
SELECT TOP (1000000) RAND(CAST(a.object_id AS BIGINT) + b.column_id*2511)
FROM msdb.sys.columns a CROSS JOIN msdb.sys.columns b

-- One cluster and two nonclustered indexes on the column that will be used for the range filter

CREATE CLUSTERED INDEX idx_RangeQueries_Cluster
ON TestingRangeQueries (ID)

CREATE NONCLUSTERED INDEX idx_RangeQueries_NC1
ON TestingRangeQueries (ID)

CREATE NONCLUSTERED INDEX idx_RangeQueries_NC2
ON TestingRangeQueries (ID)
INCLUDE (SomeValue)
GO

The query that I’ll be testing with will do a sum of the SomeValue column for a large range of ID values. That means that of the three indexes that I’m testing, one is clustered, one is a nonclustered that does not cover the query and the third is a covering nonclustered index.

SELECT SUM(SomeValue)
FROM TestingRangeQueries
WHERE ID BETWEEN 20000 and 200000 -- 180 001 rows, 18% of the table

I’m going to run the same range scan query three times, each with an index hint so that SQL will use the three different indexes, regardless of which one it thinks is best.

First up, the clustered index.

As expected, we get a clustered index seek (the predicate is SARGable) and a stream aggregate.

ClusteredIndex

Table ‘TestingRangeQueries’. Scan count 1, logical reads 12023, physical reads 0.

SQL Server Execution Times:
CPU time = 94 ms,  elapsed time = 110 ms.

(more…)

Distincting an IN subquery

This is going to be a quick one…

I keep seeing forum code (and production code) that includes the DISTINCT in IN or EXISTS subqueries. The rationale is given either as a performance enhancement or as necessary for correct results.

Is it necessary or useful? Only one way to find out.

Let’s check for correct results first, because that can be done with nice small tables.

CREATE TABLE DistinctOuter (
ID INT
);

CREATE TABLE DistinctInner (
ID INT
);

INSERT INTO DistinctOuter
VALUES (1), (2), (3), (4), (5), (6), (7), (8)

INSERT INTO DistinctInner
VALUES (1), (2), (2), (2), (2), (4), (6), (7)

DistinctIN

(more…)

Capturing the Execution Plan

One last post on execution plans and Profiler (at least for now)

When trying to check a query’s execution plan, the display execution plan option of Management Studio is usually adequate, however there are occasions where it’s either not feasible to run the query from Management Studio or the particular behaviour can’t be reproduced in Management Studio (perhaps because of different SET options). For cases like this it may be necessary to capture the execution plans via SQL Trace. Fortunately there are events for execution plans. Eight of them, to be precise, all under the Performance folder.

PlanEvents

Great, so there’s no shortage of options available. But what are the differences between them?

Showplan All

According to Books Online:

The Showplan All event class occurs when Microsoft SQL Server executes an SQL statement. Include this event class to identify the Showplan operators on Microsoft SQL Server 2000 or Microsoft SQL Server 7.0. This event class will also work on SQL Server 2005 and later; however, the information included is a subset of the information available in the Showplan XML Statistics Profile or Showplan XML event class.

So in other words this one is not generally the event that we should be looking at on the latest versions of SQL. It’s not deprecated however (at least not in SQL 2008), so it is still usable on the later versions if you absolutely want.

Showplan

(more…)

Do IF statements cause recompiles?

I heard this one over at SSC a while back. “Avoid IF statements in stored procedures as they result in recompiles”

Ok, it sounds reasonable, if the optimiser optimises for the execution path taken on first execution it’ll have to go back and optimise the other paths when they are executed.

But is that really how it works? Now that I’ve spend some time looking at how the cache monitoring trace events behave, it’s possible to find out using those.

Let’s start with a simple example

CREATE TABLE Tbl1 (
  ID INT
);

CREATE TABLE Tbl2 (
  ID VARCHAR(10),
  SomeDate DATETIME
);
GO

CREATE PROCEDURE TestingRecompiles (@SomeParam INT)
AS
IF (@SomeParam = 1)
  SELECT ID FROM Tbl1;
ELSE
  SELECT SomeDate FROM Tbl2;
GO

Simple enough. First execution will be with the parameter value of 1. I’m going to use Profiler to see what’s happening. Events traced are SP:CacheInsert, T-SQL:StmtRecompile and the XML Plan for query compile, so I can see exactly what plan was generated. I’m using the ‘For Query Compile’ event so that I can catch the plan at optimisation time, not at execution time.

TraceEvents

(more…)

Recompiles

I’ve looked at cache hit and miss events and at the cache insert and remove events. The only cache-monitoring event (of current interest) left is the recompile event.

There are two recompile events available in Profiler in SQL 2008

  • SP:Recompile under Stored Procedures
  • SQL:StmtRecompile under T-SQL

Which to use when?

Books Online has the following to say on the two events.

The SQL:StmtRecompile event class indicates statement-level recompilations caused by all types of batches: stored procedures, triggers, ad hoc batches, and queries. Starting in SQL Server 2005, the SQL:StmtRecompile event class should be used instead of the SP:Recompile event class.

and

The SP:Recompile event class indicates that a stored procedure, trigger, or user-defined function has been recompiled. In SQL Server 2005 and later, recompilations reported by this event class occur at the statement level, whereas those in SQL Server 2000 occurred at the batch level.

In SQL Server 2005 and later, the preferred way to trace statement-level recompilations is to use the SQL:StmtRecompile event class. Starting in SQL Server 2005, the SP:Recompile event class is deprecated.

So it appears that they show the same thing and SP:Recompile is deprecated. That simplifies the entire situation, the only one that I’m going to look at in that case is the SQL:StmtRecompile event.

So what does the event look like?

RecompileEvents

(more…)

First Impressions

SELECT @@Version

Microsoft SQL Server “Denali” (CTP1) – 11.0.1103.9 (X64)

I’ve been playing with the CTP for a few days (with absolutely no documentation) and there are a few fun and interesting things that I found. These may not be earth-shattering changes, but they are interesting, and I don’t doubt there’s much more fun to come.

Configuration manager

It’s now much easier to add and remove startup parameters for the service. No more of SQL not starting because you forgot the ; between the end of the location of the master log file and the traceflag that you added

Startup parameters

What’s SQL HADR all about? Wish I knew. It requires a clustered installation and I don’t have a cluster handy to play with (yet). So no investigations of the Denali High availability/Disaster recovery features yet.

New SQL Server features

Sequences

People have been asking for sequences in SQL for ages (1) (2) and workarounds for creating replacements have been around as long (3). Now they’re here.

Sequences

So what can we do with these things?

(more…)

A Trio of Table Variables

So, it’s the second Tuesday of the month again, and it’s time for T-SQL Tuesday again. TSQL2sDay150x150This month it’s hosted by Sankar Reddy and the topic is “Misconceptions in SQL Server

I thought I’d tackle a trio of table variable myths and partial truths.

Table Variables are memory-only

This one is pervasive and irritating. It typically goes like this:

You should use table variables rather than temp tables because table variables are memory only.

This myth can be broken down into two parts:

  1. That table variables are not part of TempDB
  2. That table variables are not written to disk

The first is easy to prove and has been done repeatedly. I’m not doing it again. I’m going to tackle the second portion only.

See, one could argue that, even though the table variable is created in the TempDB system tables and allocated pages within the TempDB data file, it is still kept entirely and only in memory. Let’s see if that’s true…

DECLARE @LargeTable TABLE (
id INT IDENTITY PRIMARY KEY,
LargeStringColumn1 CHAR(100),
LargeStringColumn2 CHAR(100)
)

INSERT INTO @LargeTable (LargeStringColumn1, LargeStringColumn2)
SELECT TOP (100000) 'Table Variable Test','T-SQL Tuesday!'
FROM master.sys.columns a CROSS JOIN master.sys.columns b

WAITFOR DELAY '00:01:00' -- so that the table var doesn't go out of scope and get deallocated too quickly.

This is not a massively large table. 100000 rows at 204 bytes per row (excluding header). A query of sys.dm_db_index_physical_stats (which does work on temp tables and table variables) reveals a total page count of 2632. That’s a grand total of 20.6 MB. 20 Megabytes. The SQL instance I’m running this on is allowed to use up to 2 GB of memory. No way on earth is this table variable going to cause any form of memory pressure (and I promise there is nothing else running)

So, run that code and, while that waitfor is running, do something that should never be done to a SQL server that you care anything about. (more…)

One wide index or multiple narrow indexes?

TSQL2sDay150x150 Or “If one index is good, surely many indexes (indexes? indices? indi?) will be better

This is a question that comes up very often on the forums. Something along the lines of:

I have a query with multiple where clause conditions on a table. Should I create one index for each condition, or one index with all the columns in it?

The question basically boils down to this: Which is more optimal and more likely for the optimiser to pick, a single seek operation against a wide index that seeks on all three conditions in one go, or three seek operations against three indexes followed by a join to get back the final set of rows.

One thing to keep in mind is that one of the jobs of an index is to reduce the number of rows in consideration for a query as early as possible in the query’s execution.

So let’s take a made-up example. Let’s say we have a table with a number of columns in it. A query is run against that table with three conditions in the where clause

WHERE ColA = @A AND ColB = @B AND ColC = @C

Let’s further say that 1000 rows qualify for the condition ColA = @A, 15000 rows qualify for ColB = @B and 30000 rows qualify for ColC = @C. The total number of rows that qualify for all three conditions is 25.

Which sounds like it would be more efficient?

  • Seek on an index with all three columns and retrieve just 25 rows
  • Seek on an index on ColA, retrieve 1000 rows, seek on an index on ColB, retrieve 15000 rows, seek on an index on ColC, retrieve 30000 rows then join the three result-sets together to get the desired 25 rows (called an Index Intersection)

Time for some tests to find out.

(more…)