Syndication

And now for a completely inappropriate use of SQL Server

A while back I wrote up a short introductory overview of Genetic Algorithms. Just for the shear, absolute fun of it, I thought I’d implement a basic genetic algorithm within SQL Server and use it to solve a form of the knapsack problem.

Now first a few comments on this. As the title states, this is not really an appropriate use of SQL Server. Genetic algorithms are generally implemented in languages like Java, C++, C#, etc; languages that are good at complex mathematics, string manipulation and have complex data types. I’m not necessarily using efficient, well-performing methods here, UDFa abound. This is not an article on best practices and well-performing code. I’m also doing no error handling, which I would if this were a real system (in a more suitable language)

Still, doing just for the sake of seeing if it’s possible is all sorts of fun. So, without further ado, the knapsack problem, an approximate solution with genetic algorithms in SQL Server. (For the record, this is a multi-constrained, bounded knapsack problem)

The scenario

There’s a bag that has a maximum volume that it can hold and a maximum mass that it can hold (and we assume that we can pack perfectly with no wasted space). There are eight items, all with different masses, different volumes and different values. The goal here is to maximise the total value of all the items contained within the bag.

CREATE TABLE ObjectStatistics (
  ObjectNumber TINYINT NOT NULL,
  Mass NUMERIC(4,2) NOT NULL,
  Volume NUMERIC(4,2) NOT NULL,
  Value NUMERIC(4,2) NOT NULL,
  NumberAvailable TINYINT NOT NULL,
  CONSTRAINT pk_ObjectStats PRIMARY KEY CLUSTERED (ObjectNumber)
);

CREATE TABLE BagStatistics (
  MaxMass NUMERIC(5,2),
  MaxVolume NUMERIC(5,2)
);

INSERT INTO dbo.ObjectStatistics (ObjectNumber, Mass, Volume, Value, NumberAvailable)
VALUES
  (1,0.5,0.45,2,50),
  (2,1.5,0.25,20,10),
  (3,0.1,1.25,8,150),
  (4,0.25,0.1,1,250),
  (5,0.67,0.3,6,100),
  (6,0.34,0.75,18,5),
  (7,1.25,0.25,5,40),
  (8,1.1,0.8,10,25);

INSERT INTO dbo.BagStatistics (MaxMass, MaxVolume)
VALUES  (100, 75);

Those two tables set up the constraints for the scenario, the maximum mass and volume for the bag and the mass, volume, value and maximum number available for each of the items.

(more…)

Are int joins faster than string joins?

This one comes up a lot on the forums, often as advice given…

“You should use integers for the key rather than strings. The joins will be faster.”

It sounds logical. Surely integers are easier to compare than large complex strings. Question is, is it true?

This is going to take lots and lots of rows to get any significant results. I’m going to do two sets of tests. The first comparing query execution speed between string and integer data types on the join column while keeping the size of the join column the same between the two. This is to tell if there’s a difference just because of the data type.

The second test will have both the data type and the size of the join columns differing, while the total size of the table row will be kept the same. This is to answer the ‘string joins are slower because they are larger’ argument.

Test 1: Same key size, no indexes

The two tables have the same size join column – a bigint in one and a char(8) in the other.

SELECT t1.ID, t2.ID, t1.IntForeignKey, t2.SomeArbStatus
FROM dbo.TestingJoinsInt t1
INNER JOIN dbo.LookupTableInt t2 ON t1.IntForeignKey = t2.ID
GO

SELECT t1.ID, t2.ID, t1.StrForeignKey, t2.SomeArbStatus
FROM dbo.TestingJoinsString t1
INNER JOIN dbo.LookupTableString t2 ON t1.StrForeignKey = t2.ID
GO

First up, a check of what Statistics IO and Statistics Time show for the two and whether there’s any difference in execution plan. (Full repo code is available for download, link at the end of the post)

Int joins

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsInt’. Scan count 1, logical reads 66036, physical reads 0.
Table ‘LookupTableInt’. Scan count 1, logical reads 735, physical reads 0.

SQL Server Execution Times:
CPU time = 2433 ms,  elapsed time = 32574 ms.

IntJoins1

String joins

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsString’. Scan count 1, logical reads 66036, physical reads 0.
Table ‘LookupTableString’. Scan count 1, logical reads 735, physical reads 0.

SQL Server Execution Times:
CPU time = 3744 ms,  elapsed time = 33947 ms.

StringJoins1

(more…)

Quest Webcast: Dos and Don’ts of Database corruption

I was guest presenter on last week’s Quest Pain of the Week presentation. The presentation is available for download from Quest. This is a roundup of the questions from the presentation. (I have edited some of the questions for spelling, grammar and readability)

Q: Why do you have to wait for a maintenance window to take a backup? you can backup during the time the database is being used.

You don’t have to wait, and with a 24×7 system you obviously can’t. But full backups do have an impact even though database can be in use at the time. They’re IO intensive. If you’re going to do backups while the system is in use, make sure that the additional IO load doesn’t adversely effect user performance, or look into things like SAN-level snapshot backups.

Q: If you partition your data, can you only backup the latest partition?

Yes, however be mindful of the requirements for restoring from file/filegroup backups. In full or bulk-logged recovery if the older filegroups are not read-only, you’ll need log backups covering the entire span from the oldest of the file/filegroup backups you use to the newest.

If the partitions with older data are read-only, then this is very easy to do.

For more information see http://msdn.microsoft.com/en-us/library/ms177425.aspx

Q: What if you inherit a database that fails the integrity check every day and the backups have the same issue so you don’t know when the integrity check originally failed?

That just means that restore from backup (full, file/filegroup or page) is not an option for fixing. Probably means that there will be no avoiding data loss.

Q: The 825 errors – these are reported in the logs too aren’t they? I can review to see if these are occurring, even if the alert is not set?

Yes they are (and I don’t know why I thought during the presentation that they aren’t, I’ve written a blog post on them myself). https://www.sqlinthewild.co.za/index.php/2008/12/06/when-is-a-critical-io-error-not-a-critical-io-error/

Msg 825, Level 10, State 2, Line 1.
A read of the file ‘D:\Data\SomeDB.mdf’ at offset 0x00000020e24000 succeeded after failing 1 time(s) with error: incorrect checksum (expected: 0x7532c420; actual: 0x320e4240). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Q: Can you use Resource Governor for a large DB backup to keep the DB’s performance at an acceptable level?

Resource governor only limits CPU and memory, not IO.

Q: If you have data loss, shouldn’t you be doing a restore?

Maybe. It could be (due to time to restore or other constraints) that the fastest and acceptable solution is to repair (CheckTable or CheckFileGroup) and then sync in missing data later.

Q: I just had this happen on an archive database that I can play with. The problem occurred with lob’s. Since allow data loss removes the entire page that the corruption is on how does this work with lob’s that are not stored on the same page as the record?

The entire LOB chain (all LOB pages for the row) will be deallocated and the data/index record which is the parent will be deleted. (Ref: Chapter 11 of SQL Server 2008 Internals by Kalen Delaney et al. Chapter 11 written by Paul Randal)

Q: Is it possible to create a corrupted database in a test lab to get an 825 error?

Error 825 is a hard one to generate. To get that you need for an IO to fail or the retrieved page to be damaged and then on a retry (which occurs immediately) for the IO to succeed and the page to be correct. That’s hard to fake. You’ll need an IO filter driver that is coded to intermittently mess up IOs.

For 824, either use a hex editor to edit pages of a test database, or download and restore Paul Randal’s pre-corrupted databases.

For 823, create a database with a secondary file on a flash drive. Create a large table in the filegroup on the flash drive, checkpoint, DBCC dropcleanbuffers, then pull the flash drive out and query the table. If that doesn’t throw the error (meaning the pages are still in memory), update a row and checkpoint.

It goes without saying that neither of these should ever be done on a server!

Q: In case of db suspect (not recoverable) can we update the system table sysdatabases and change the status to make it online (this solution used to work in SQL 2000)

That is a SQL 2000 solution. In SQL 2005 and above, sysdatabases isn’t a table and the system tables cannot be updated as was possible in SQL 2000.

Fortunately there’s a fully documented and supported alternative – ALTER DATABASE <Database name> SET EMERGENCY. Once in emergency mode you can run CheckDB with repair_allow_data_loss and then bring the database online (assuming the repair succeeds), or extract data if the repair fails.

Q: Is Check DB the only tool available?

To check database integrity, yes.

Q: If you got a foreign key, you can restore a table but will integrity not be broken?

Relational and possibly transactional integrity may be broken. I’d guess that’s one reason table-level restores aren’t available natively in the product.

Q: Checksum on database backup – this is not an option in maintenance plan – true? To enable this, I have to script all database backups?

Correct, it’s not an option on maintenance plans. Personally I’m not a fan of maintenance plans for anything other than trivial environments.

Q: We use some SQL Express, do all these items work in Express? I had one that the hard drive died and I was able to get it back from suspect at which point I did a full backup and restored it elsewhere on a full SQL Server to fix it

The only limitation for CheckDB on Express is no parallelism.

File/filegroup restores can be done on Express, though they are offline restores.

Page restores can be done on Express, though they are offline restores


Thanks to Paul Randal and Robert Davis for assistance with a couple of these answers.

Is a clustered index best for range queries?

I see a lot of advice that talks about the clustered index been the best index for use for range queries, that is queries with inequalities filters, queries that retrieve ranges of rows, as opposed to singleton queries, queries that retrieve single rows (including, unfortunately, a Technet article).

I suspect the reasoning behind this advice is the idea that the clustered index stores the data in order of the clustering key (ack) and hence it’s ‘logical’ that such a structure would be best for range scans as SQL can simply start at the beginning of the range and read sequentially to the end.

Question is, is that really the case?

Let’s do some experiments and find out.

CREATE TABLE TestingRangeQueries (
ID INT IDENTITY,
SomeValue NUMERIC(7,2),
Filler CHAR(500) DEFAULT ''
)

-- 1 million rows
INSERT INTO TestingRangeQueries (SomeValue)
SELECT TOP (1000000) RAND(CAST(a.object_id AS BIGINT) + b.column_id*2511)
FROM msdb.sys.columns a CROSS JOIN msdb.sys.columns b

-- One cluster and two nonclustered indexes on the column that will be used for the range filter

CREATE CLUSTERED INDEX idx_RangeQueries_Cluster
ON TestingRangeQueries (ID)

CREATE NONCLUSTERED INDEX idx_RangeQueries_NC1
ON TestingRangeQueries (ID)

CREATE NONCLUSTERED INDEX idx_RangeQueries_NC2
ON TestingRangeQueries (ID)
INCLUDE (SomeValue)
GO

The query that I’ll be testing with will do a sum of the SomeValue column for a large range of ID values. That means that of the three indexes that I’m testing, one is clustered, one is a nonclustered that does not cover the query and the third is a covering nonclustered index.

SELECT SUM(SomeValue)
FROM TestingRangeQueries
WHERE ID BETWEEN 20000 and 200000 -- 180 001 rows, 18% of the table

I’m going to run the same range scan query three times, each with an index hint so that SQL will use the three different indexes, regardless of which one it thinks is best.

First up, the clustered index.

As expected, we get a clustered index seek (the predicate is SARGable) and a stream aggregate.

ClusteredIndex

Table ‘TestingRangeQueries’. Scan count 1, logical reads 12023, physical reads 0.

SQL Server Execution Times:
CPU time = 94 ms,  elapsed time = 110 ms.

(more…)

Upcoming presentations

It’s looking like it’s going to be a busy year.

It’s only January and I already have three presentations lined up for the first quarter of the year.

  • On the 10th February I’m presenting on database corruption for Quest’s Pain of the Week webcast. I’m not Paul Randal, but I think I might be able to muddle through the topic without making too much of a fool of myself.
  • 27th February I’m speaking at SQLSaturday 65 in Vancouver! First trip to Canada, first SQLSaturday attendance. I’m doing a presentation that’s an old favourite of mine – Introduction to Indexes. It’s very much aimed at beginners, so don’t expect any dark magic or mystical secrets
  • 15th March I’m taking part in the 24 Hours of PASS. The session is titled “Bad plan! Sit” Thanks to Steve Jones (blog|twitter) for the initial idea.It’s going to have something to do with bad execution plans. I haven’t decided exactly what yet. I guess that means it will be a surprise.

In addition to these, I’m also presenting at both the February and March meetings of the South African SQL Server usergroup.

That’s definitely enough work for the next two months…

Distincting an IN subquery

This is going to be a quick one…

I keep seeing forum code (and production code) that includes the DISTINCT in IN or EXISTS subqueries. The rationale is given either as a performance enhancement or as necessary for correct results.

Is it necessary or useful? Only one way to find out.

Let’s check for correct results first, because that can be done with nice small tables.

CREATE TABLE DistinctOuter (
ID INT
);

CREATE TABLE DistinctInner (
ID INT
);

INSERT INTO DistinctOuter
VALUES (1), (2), (3), (4), (5), (6), (7), (8)

INSERT INTO DistinctInner
VALUES (1), (2), (2), (2), (2), (4), (6), (7)

DistinctIN

(more…)

Capturing the Execution Plan

One last post on execution plans and Profiler (at least for now)

When trying to check a query’s execution plan, the display execution plan option of Management Studio is usually adequate, however there are occasions where it’s either not feasible to run the query from Management Studio or the particular behaviour can’t be reproduced in Management Studio (perhaps because of different SET options). For cases like this it may be necessary to capture the execution plans via SQL Trace. Fortunately there are events for execution plans. Eight of them, to be precise, all under the Performance folder.

PlanEvents

Great, so there’s no shortage of options available. But what are the differences between them?

Showplan All

According to Books Online:

The Showplan All event class occurs when Microsoft SQL Server executes an SQL statement. Include this event class to identify the Showplan operators on Microsoft SQL Server 2000 or Microsoft SQL Server 7.0. This event class will also work on SQL Server 2005 and later; however, the information included is a subset of the information available in the Showplan XML Statistics Profile or Showplan XML event class.

So in other words this one is not generally the event that we should be looking at on the latest versions of SQL. It’s not deprecated however (at least not in SQL 2008), so it is still usable on the later versions if you absolutely want.

Showplan

(more…)

Do IF statements cause recompiles?

I heard this one over at SSC a while back. “Avoid IF statements in stored procedures as they result in recompiles”

Ok, it sounds reasonable, if the optimiser optimises for the execution path taken on first execution it’ll have to go back and optimise the other paths when they are executed.

But is that really how it works? Now that I’ve spend some time looking at how the cache monitoring trace events behave, it’s possible to find out using those.

Let’s start with a simple example

CREATE TABLE Tbl1 (
  ID INT
);

CREATE TABLE Tbl2 (
  ID VARCHAR(10),
  SomeDate DATETIME
);
GO

CREATE PROCEDURE TestingRecompiles (@SomeParam INT)
AS
IF (@SomeParam = 1)
  SELECT ID FROM Tbl1;
ELSE
  SELECT SomeDate FROM Tbl2;
GO

Simple enough. First execution will be with the parameter value of 1. I’m going to use Profiler to see what’s happening. Events traced are SP:CacheInsert, T-SQL:StmtRecompile and the XML Plan for query compile, so I can see exactly what plan was generated. I’m using the ‘For Query Compile’ event so that I can catch the plan at optimisation time, not at execution time.

TraceEvents

(more…)

Recompiles

I’ve looked at cache hit and miss events and at the cache insert and remove events. The only cache-monitoring event (of current interest) left is the recompile event.

There are two recompile events available in Profiler in SQL 2008

  • SP:Recompile under Stored Procedures
  • SQL:StmtRecompile under T-SQL

Which to use when?

Books Online has the following to say on the two events.

The SQL:StmtRecompile event class indicates statement-level recompilations caused by all types of batches: stored procedures, triggers, ad hoc batches, and queries. Starting in SQL Server 2005, the SQL:StmtRecompile event class should be used instead of the SP:Recompile event class.

and

The SP:Recompile event class indicates that a stored procedure, trigger, or user-defined function has been recompiled. In SQL Server 2005 and later, recompilations reported by this event class occur at the statement level, whereas those in SQL Server 2000 occurred at the batch level.

In SQL Server 2005 and later, the preferred way to trace statement-level recompilations is to use the SQL:StmtRecompile event class. Starting in SQL Server 2005, the SP:Recompile event class is deprecated.

So it appears that they show the same thing and SP:Recompile is deprecated. That simplifies the entire situation, the only one that I’m going to look at in that case is the SQL:StmtRecompile event.

So what does the event look like?

RecompileEvents

(more…)

First Impressions

SELECT @@Version

Microsoft SQL Server “Denali” (CTP1) – 11.0.1103.9 (X64)

I’ve been playing with the CTP for a few days (with absolutely no documentation) and there are a few fun and interesting things that I found. These may not be earth-shattering changes, but they are interesting, and I don’t doubt there’s much more fun to come.

Configuration manager

It’s now much easier to add and remove startup parameters for the service. No more of SQL not starting because you forgot the ; between the end of the location of the master log file and the traceflag that you added

Startup parameters

What’s SQL HADR all about? Wish I knew. It requires a clustered installation and I don’t have a cluster handy to play with (yet). So no investigations of the Denali High availability/Disaster recovery features yet.

New SQL Server features

Sequences

People have been asking for sequences in SQL for ages (1) (2) and workarounds for creating replacements have been around as long (3). Now they’re here.

Sequences

So what can we do with these things?

(more…)