Syndication

Pass summit abstracts

I missed the Pass summit last November for a number of reasons, but I’m hoping I can attend this year. Given that, I submitted a number of abstracts for consideration.

The limits on number were as follows: Up to 4 abstracts for regular, spotlight or half-day sessions (ie main conference sessions) plus, providing some requirements are met, up to 2 abstracts for pre-con sessions.

I submitted five abstracts, four were ones I’ve been thinking about for some time, the last was a last minute surprise.

Bad plan! Sit!

Bad execution plans are the bane of database performance everywhere they crop up. But what is a bad execution plan? How do you identify one in your system and, once identified how do you go about fixing it?

In this top-rated session from the 24 Hours of PASS we’ll look at some things that make a plan ‘bad’, how you might detect such plans and various methods of fixing the problem, both immediately and long-term.

I wasn’t planning on submitting this one to be honest. It’s one I did for the 24 hours of Pass back in February, and I didn’t think it appropriate to redo at the Summit the same year. Some people at Pass disagreed with me on that. The session was one of the top 3 from 24 hours of Pass and as such got a guaranteed slot at Pass summit, and a chance for a Spotlight slot.

Is that a parameter I smell?

All too often a forum post on erratic query performance is met with a reply ‘Oh, it’s parameter sniffing. You can fix it with .’ The problem with that answer, even if it has identified the cause, is that it’s only part true. Parameter sniffing is not simply a problem that needs fixing, it’s an essential part of well-performing queries. In most cases.

Come to this session to learn what Parameter Sniffing really is and why it’s a good thing, most of the time. Learn how to identify the scenarios where it’s not good, why a feature that is supposed to improve query performance sometimes degrades it, and what your options are for resolving the problems when they do occur.

Performance improvements in 60 min or less. Guaranteed

The system is slow, users cry
It’s impacting our bottom line
In meetings the curses fly
The situation is far from fine

While up on the IT floor
The DBA tears his hair
He’ll soon be shown the door
If he cannot the performance repair,

It often seems, from looking at forum posts and client requests, that the steady-state of databases is ‘too slow’ and all too often the people who are tasked with resolving performance problems are overwhelmed by the shear scope of the problem, and aren’t really sure where to start.

In this demo-heavy session, we’ll look at a fictional company’s database and website and work through finding the worst offenders in terms of poor performance, identifying the causes of the problems and at least starting to get the queries to perform better and the users to stop phoning and complaining.

Yes, part of the abstract is written in verse. It’s something I was threatening to do since last year, and I decided that it shouldn’t harm my chances, much.

Dos and don’ts of database corruption

Database corruption is one of the worst things you can encounter as a DBA. It can result in downtime, data loss, and unhappy users. What’s scary about corruption is that it can strike out of the blue and with no warning. If maintenance is not being done regularly on the database it’s easy for corruption to go unnoticed until it’s too late to repair without losing data.
In this session we’ll look at

  • Easy maintenance operations you should be running right now to ensure the fastest possible identification and resolution of corruption
  • Best practices for handling a database that you suspect may be corrupted
  • Common actions that can worsen the problem.
  • Appropriate steps to take and methods of recovery

I did this session initially for Quest, for their Pain of the Week webcast back in February. I enjoyed doing it a lot, and it got some good feedback, so I decided to submit it for Summit as a contrast to my usual performance-related presentations.

All about Execution Plans

Last, but far from least…

Grant Fritchey (blog | twitter) and I submitted a join pre-con session on execution plans. This is an area Grant is a well-known expert on, having written a book on it, and we’re hoping that we’ll be given the chance to devote a full day to this topic.

The key to understanding how SQL Server is processing your queries is the execution plan.

This full day session focuses on the execution plan. We will start right at the beginning and talk about the compile process. We’ll also go over how, and more importantly, why, plans are stored in cache and how they are removed.

We’ll spend time exploring the key differences between actual and estimated plans, and why those descriptions are more than a little misleading. We’ll also show you assorted methods to obtain a query’s execution plan and what the differences and tradeoffs of each are.

A full day class on execution plans would not be complete without spending time learning to reading them. You’ll learn where to find useful information in execution plans, what the common operators are and how to decipher the sometimes cryptic messages the plans are sending to you. We’ll also debunk some myths surrounding query operators and execution plans.

All of this is meant to further your understanding of how queries work in order to improve the queries you’re responsible for. With this in mind, we’ll show how you can use execution plans to tune queries. All of the information presented will be taken from real world examples. We’ll build on the information through the day so that at the end, after following us through multiple examples at your own computer, you’ll have a stronger understanding of how to read, interpret and actually use execution plans in your day-to-day job.

Sql Saturday #83

imageFrikkie Bosch opened the day with a short keynote giving an overview of Denali, talking mostly from a marketing point of view about the major features coming in the next version.

The virtual sessions were very popular, we don’t get many good international speakers down in our part of the world and the opportunity to listen to some new speakers was well received.

Benjamin kicked off the virtual sessions with his ’10 query optimiser topics for better performance’. He covered parameter sniffing, showing how the optimiser estimates row counts based on statistics and the sniffed value of parameters, cardinality estimation errors (poor stats or missing statistics) and scalar expressions (hard for the optimiser to guess rows affected). Also discussed were the missing index DMVs and DTA (the database tuning advisor)

My session on poorly performing SQL went very well. It’s a session I’ve done several times and enjoy thoroughly every time. The audience was great, lots of good questions, good comments and good fun. I ran a bit short on time to discuss the fixes for the multiple execution path problem, so I’ll just direct anyone reading to a blog post on it: http://www.sqlinthewild.co.za/index.php/2009/09/15/multiple-execution-paths/

Big thanks to all the organisers, all the speakers, all the attendees. Special thanks to the international speakers who either stayed up late or woke early to give their sessions.

SQLSat83-1

SQLSat83-2

SQLSat83-3 SQLSat83-4

SQLSat83-5 SQLSat83-6

SQLSat83-7

SQLSat83-8

SQLSat83-9

SQLSat83-10

Indexing for ORs

All of the indexing strategy posts I’ve written in the past have been concerned with predicates combined with ANDs. That’s only one half of the possibilities though. There’s the case of predicates combines with ORs, and the guidelines for indexing that work well with ANDs don’t work with ORs

When dealing with predicates combined with AND, the predicates are cumulative, each one operates to further reduce the resultset.

For this reason, multi-column indexes support multiple predicates combined with AND operators.

If we look at a quick example, consider the following.

CREATE TABLE Customers (
  CustomerID INT IDENTITY PRIMARY KEY,
  Surname VARCHAR(30) NOT NULL,
  FirstName VARCHAR(30),
  Title VARCHAR(5),
  CustomerType CHAR(1) NOT NULL,
  IsActive BIT DEFAULT 1 NOT NULL,
  RegistrationDate DATETIME NOT NULL DEFAULT GETDATE()
);</pre>
CREATE INDEX idx_Customers_SurnameFirstName ON Customers (Surname, FirstName);

Again I’m going to be lazy and get SQLDataGenerator to generate a few rows.

With that two column index on those columns and a query that looks for Surname = ‘Kelley’ AND Name = ‘Rick’, SQL can do a double column seek to go directly to the start of the range then just read down the index to the end of the range, basically until it finds the first row that it’s not interested in.

So how does that that differ when the operator is an OR?

The main difference is that with an OR, the predicates are independent. The second doesn’t serve to reduce the recordset, but rather to expand it. It’s similar to evaluating two separate predicates and combining the result. Let’s have a look at that 2 column index again when the two predicates are combined with an OR.

SELECT CustomerID
  FROM Customers
  WHERE Surname = 'Kelley' OR FirstName = 'Rick';

If we try to use that index to evaluate Surname = ‘Kelley’ OR Name = ‘Rick’, there’s a problem. While the first of those predicates can be evaluated with a seek (it’s a sargable predicate on the left-most column of an index), the second predicate cannot. It’s sargable, but it is on the second column of the index (and for the moment let’s assume there are no other indexes on the table). Seeks are only possible if the predicate filters on a left-based subset of the index key.

Hence to evaluate that predicate SQL will have to do an index scan. Since it has to do a scan to evaluate the one predicate, it won’t bother also doing a seek to evaluate the first predicate as it can also evaluate that during the scan.

Hence, in this case, the query will execute with a single index scan.

IndexScanWithOr

So how do we get this query to rather seek?

(more…)

Meme Monday: I Got 99 SQL Problems And the Disk Ain’t One

From Tom LaRock’s latest idea

For a change I’m not going to mention performance (well, not more than once anyway) and I’m not going to preach backups or recoverability. Nor am I going to talk about technical issues like many others are doing.

Instead I’m going to briefly mention several (well, 9) problems that I’ve seen a lot in last year or so. These certainly aren’t specific to databases, but I’ve been seeing them with regards to databases, so…

Poor/missing development environment

A development environment that’s months out of date with production, has different databases from production (eg has databases that in the production environment are on different servers) will cause problems. Either code will never work in dev (eg missing linked servers) and hence can’t be properly tested or works fine in dev and fails in production.

Either way, the appropriate response from the senior architect when told that the development environment is months out of date is not "Well, yes. I could have told you that. So what?"

Lax security

When everyone, from developers to help desk to business analysis has full sysadmin access, things are going to go wrong, sooner or later. Probably sooner. From new databases that no one knows anything about, to missing databases, altered settings that break backups or other maintenance, code changes that ‘no one’ made, missing data, etc.

This kind of setup is a bloody pain to fix; without strong support from management it’s near-impossible to fix. Far better to start correctly if possible

Lack of change control

There must be some control over what changes are made, by who, to what and when. Developers should not be just deploying their changes whenever they feel like it with no regard to what other people are doing, whether the system is in use or not, or who is inconvenienced.

Likewise there needs to be some record of what was done, when and why. If no one knows what is being done then when something breaks (and it’s when, not if) there will be no way to tell where to start looking.

 

(more…)

To TOP or not to TOP an EXISTS

Earlier this year I had a look at a query pattern that I often see on forums and in production code, that of the Distinct within an IN subquery. Today I’m going to look at a similar patters, that being the use of TOP 1 within an EXISTS subquery.

Three tests. First a straightforward exists with no correlation (no where clause linking it to an outer query). Second, an exists with a complex query (one with a non-sargable where clause and a group by and having). Third an exists subquery correlated to the outer query.

Table structures are nice and simple, in fact, for ease I’m going to use the same tables as I did back on the exists, in and inner join tests. Code to create and populate the tables it attached to the end of the post.

First up, a simple exists query, in an IF, just to be different.

IF EXISTS (SELECT 1 FROM PrimaryTable_Medium)
PRINT 'Exists'

IF EXISTS (SELECT TOP (1) 1 FROM PrimaryTable_Medium)
PRINT 'Exists too'

For a benchmark, a SELECT 1 FROM PrimaryTable_Medium has the following IO characteristics

Table ‘PrimaryTable_Medium’. Scan count 1, logical reads 89, physical reads 0.

SQL Server Execution Times:
CPU time = 15 ms,  elapsed time = 510 ms.

Ignore the elapsed time, that’s likely mostly from displaying the records. I’m going to focus mostly on the CPU and IO.

Execution plans of the two exists variations are absolutely identical.

TopExists1

The index operators are scans because there is no way they could be anything else, there’s no predicate so a seek is not possible. That said, it’s not a full index scan. The properties of the Index Scan show 1 row only (actual and estimated). So SQL did not read the entire index, just enough to evaluate the EXISTS, and that’s what it did in both cases. IO stats confirm that.

(more…)

Statistics, row estimations and the ascending date column

SQL’s auto-updating statistics go a fair way to making SQL Server a self-tuning database engine and in many cases they do a reasonably good job

However there’s one place where the statistics’ auto-update fails often and badly. That’s on the large table with an ascending column where the common queries are looking for the latest rows.

Let’s have a look at a common scenario.

We have a large table (imaginatively called ‘Transactions’) with a date time column (even more imaginatively called ‘TransactionDate’). This table gets about 80,000 new records a day and currently has around 8,000,000 records in it. So we can say roughly that another 1% is added to the table size daily. No records are updated and there’s a monthly purge of old data so that the total size remains about the same. A reasonably common real-life scenario.

CREATE TABLE Accounts (
AccountID INT IDENTITY PRIMARY KEY,
AccountNumber CHAR(8),
AccountType CHAR(2),
AccountHolder VARCHAR(50),
Filler CHAR(50) -- simulating other columns
)

CREATE TABLE Transactions (
TransactionID INT IDENTITY PRIMARY KEY NONCLUSTERED,
AccountID INT NOT NULL FOREIGN KEY REFERENCES Accounts (AccountID),
TransactionDate DATETIME NOT NULL DEFAULT GETDATE(),
TransactionType CHAR(2),
Amount NUMERIC(18,6),
Filler CHAR(150) -- Simulating other columns
)
GO
CREATE CLUSTERED INDEX idx_Transactions_TransactionDate
ON Transactions (TransactionDate)

CREATE NONCLUSTERED INDEX idx_Transactions_AccountID
ON Transactions (AccountID)

CREATE NONCLUSTERED INDEX idx_Accounts_AccountType
ON Accounts (AccountType)

-- Using RedGate's SQLDataGenerator to generate some data for this.

Accounts Transactions

Day 1 of the month, the indexes have just been rebuilt (after the data purge) and the statistics associated with those have been updated. The latest value in the TransactionDate column is ‘2011/01/31’ and the last value in the statistics histogram is ‘2011/01/31’. Life is good.

Day 2 of the month, there have been 80,000 new records added for the previous day. Only 1% of the table has been updated, so the automatic statistics update would not have triggered. The latest value in the TransactionDate column is ‘2011/02/01’ and the last value in the statistics histogram is ‘2011/01/31’. Doesn’t look like a problem.

Fast forwards another couple of days. Day 5 of the month. By this point 300,000 rows have been added since the beginning of the month. This amounts to around 5% of the table. Hence the statistics auto-update (triggered at 20%) still would not have run. The latest value in the TransactionDate column is ‘2011/02/04’ and the last value in the statistics histogram is ‘2011/01/31’. Starting to look less than ideal.

So, what kind of effect does this have on the queries against that table? (more…)

Q & A from 24 Hours of PASS session

Earlier this week I took part in the 24 Hours of PASS live webcast event. There were far more questions at the end of the presentation than what I could answer online, so the answers to the rest are given here. (I have edited some of the questions for spelling, grammar and readability). My slide deck and demos I was using are linked at the end. I’m not including a backup of the DB, as it’s 60MB. The schema is there, as is a SQL Data Generator project.

Q: Is the problem here that addition of checking if the variable is null? (This was related to catch-all queries. )

No, not specifically. The problem is the query pattern of <column> = <parameter> OR <parameter> = <constant> . It’s not specific to the case where that constant is NULL, though that is the more common form that I’ve seen.

Q: What are the good alternatives for catch-all queries?

If you are on SQL 2008 SP2 or later you can add OPTION(RECOMPILE) to the catch all query and get a plan that is optimal for each particular set of parameters.

On SQL 2005 and earlier, even with a recompile you would not get an optimal plan. This is because the optimizer was required to create a plan that was safe for reuse, even if it was never going to be reused.

On builds on SQL 2008 prior to SP1, the use of OPTION (RECOMPILE) worked, but there was a bug related to that which could produce inaccurate results, so use with caution.

On builds of SQL 2008 between SP1 and SP2 (or SP1 CU5), the behaviour of a catch-all with recompile was the same as for SQL 2005

If you’re using SQL 2005 (or horrors SQL 2000) or a build of SQL 2008 that did not have the recompile options working, you can use dynamic SQL (there are advantages to dynamic SQL even on the latest builds of SQL 2008). Use the parameters passed to build up a string that just filtered on the columns for which parameters were passed, then use sp_executesql to run the parameterised dynamic SQL statement.

Details are available on my blog – http://www.sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/

Q: If you don’t have access to run the execution plan, but have to have DBAs run it for you, is there a good way to have it captured?

Absolutely. In SQL 2005 and above, the execution plan is an XML document. You can save the plan as a .sqlplan file. That file can then be copied, emailed, etc. You can open it in management studio and SSMS will display the graphical plan.

Or, you could download Plan Explorer and open the plan in that.

(more…)

Full backups, the log chain and the COPY_ONLY option.

There have been a large number of posts made regarding the interaction between full database backups and the log chain (like this one). However there still seems to be a lot of confusion around regarding how they interact. I doubt I can clear up the confusion with yet another blog post, but I’ll try.

One that seems to come up now and again is the COPY_ONLY Option on backups. Typically the comment is something like ‘Use the COPY_ONLY option when taking ad-hoc full backups so that you don’t impact the log backups.’ Now we know from the blog posts linked above that full backups don’t ever break the log chain (and I’m not going to run yet more tests to prove it) so what is the copy only option there for?

Books Online states the following regarding the COPY_ONLY option for backups – “Specifies that the backup is a copy-only backup, which does not affect the normal sequence of backups. A copy-only backup is created independently of your regularly scheduled, conventional backups. A copy-only backup does not affect your overall backup and restore procedures for the database.”

Well, that doesn’t clear things up much. It does however go on to say this: “When used with BACKUP DATABASE, the COPY_ONLY option creates a full backup that cannot serve as a differential base. The differential bitmap is not updated, and differential backups behave as if the copy-only backup does not exist. Subsequent differential backups use the most recent conventional full backup as their base.”

So it’s not the log chain that copy only is there to not affect, it’s the differential base. Let’s test and see how it works.

(more…)

And now for a completely inappropriate use of SQL Server

A while back I wrote up a short introductory overview of Genetic Algorithms. Just for the shear, absolute fun of it, I thought I’d implement a basic genetic algorithm within SQL Server and use it to solve a form of the knapsack problem.

Now first a few comments on this. As the title states, this is not really an appropriate use of SQL Server. Genetic algorithms are generally implemented in languages like Java, C++, C#, etc; languages that are good at complex mathematics, string manipulation and have complex data types. I’m not necessarily using efficient, well-performing methods here, UDFa abound. This is not an article on best practices and well-performing code. I’m also doing no error handling, which I would if this were a real system (in a more suitable language)

Still, doing just for the sake of seeing if it’s possible is all sorts of fun. So, without further ado, the knapsack problem, an approximate solution with genetic algorithms in SQL Server. (For the record, this is a multi-constrained, bounded knapsack problem)

The scenario

There’s a bag that has a maximum volume that it can hold and a maximum mass that it can hold (and we assume that we can pack perfectly with no wasted space). There are eight items, all with different masses, different volumes and different values. The goal here is to maximise the total value of all the items contained within the bag.

CREATE TABLE ObjectStatistics (
  ObjectNumber TINYINT NOT NULL,
  Mass NUMERIC(4,2) NOT NULL,
  Volume NUMERIC(4,2) NOT NULL,
  Value NUMERIC(4,2) NOT NULL,
  NumberAvailable TINYINT NOT NULL,
  CONSTRAINT pk_ObjectStats PRIMARY KEY CLUSTERED (ObjectNumber)
);

CREATE TABLE BagStatistics (
  MaxMass NUMERIC(5,2),
  MaxVolume NUMERIC(5,2)
);

INSERT INTO dbo.ObjectStatistics (ObjectNumber, Mass, Volume, Value, NumberAvailable)
VALUES
  (1,0.5,0.45,2,50),
  (2,1.5,0.25,20,10),
  (3,0.1,1.25,8,150),
  (4,0.25,0.1,1,250),
  (5,0.67,0.3,6,100),
  (6,0.34,0.75,18,5),
  (7,1.25,0.25,5,40),
  (8,1.1,0.8,10,25);

INSERT INTO dbo.BagStatistics (MaxMass, MaxVolume)
VALUES  (100, 75);

Those two tables set up the constraints for the scenario, the maximum mass and volume for the bag and the mass, volume, value and maximum number available for each of the items.

(more…)

Are int joins faster than string joins?

This one comes up a lot on the forums, often as advice given…

“You should use integers for the key rather than strings. The joins will be faster.”

It sounds logical. Surely integers are easier to compare than large complex strings. Question is, is it true?

This is going to take lots and lots of rows to get any significant results. I’m going to do two sets of tests. The first comparing query execution speed between string and integer data types on the join column while keeping the size of the join column the same between the two. This is to tell if there’s a difference just because of the data type.

The second test will have both the data type and the size of the join columns differing, while the total size of the table row will be kept the same. This is to answer the ‘string joins are slower because they are larger’ argument.

Test 1: Same key size, no indexes

The two tables have the same size join column – a bigint in one and a char(8) in the other.

SELECT t1.ID, t2.ID, t1.IntForeignKey, t2.SomeArbStatus
FROM dbo.TestingJoinsInt t1
INNER JOIN dbo.LookupTableInt t2 ON t1.IntForeignKey = t2.ID
GO

SELECT t1.ID, t2.ID, t1.StrForeignKey, t2.SomeArbStatus
FROM dbo.TestingJoinsString t1
INNER JOIN dbo.LookupTableString t2 ON t1.StrForeignKey = t2.ID
GO

First up, a check of what Statistics IO and Statistics Time show for the two and whether there’s any difference in execution plan. (Full repo code is available for download, link at the end of the post)

Int joins

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsInt’. Scan count 1, logical reads 66036, physical reads 0.
Table ‘LookupTableInt’. Scan count 1, logical reads 735, physical reads 0.

SQL Server Execution Times:
CPU time = 2433 ms,  elapsed time = 32574 ms.

IntJoins1

String joins

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsString’. Scan count 1, logical reads 66036, physical reads 0.
Table ‘LookupTableString’. Scan count 1, logical reads 735, physical reads 0.

SQL Server Execution Times:
CPU time = 3744 ms,  elapsed time = 33947 ms.

StringJoins1

(more…)