Wednesday, August 21, 2013

Query Plans Gone Bad: The Case of the Nested Loop and the Trillion Operations

At work, we have at least three huge ETL (extract, transform, load) jobs. They usually run within a normal 2-3 hour time frame. Yet every once in a while there's a crazy spike and those spikes ruin my day because the ETL jobs are laughably complex and there are downstream reports. Plus these ETL jobs run daily, so they can't pile up ...

What's causing those spikes? The short answer is bad execution plans and bad statistics.

You see, a query (a very simple INSERT query with a single joined table) was holding up one ETL job . The query normally runs in two minutes, but this time it had been running for 8 hours overnight. I started getting "over average" alerts around 6 am. To the left you can see that the 4-core CPU had been pegged until I killed the query. There was almost no disk I/O and there was nothing unusual about the memory or the network. Just a solidly pegged CPU for 8 hours. Boy, that CPU must have been hot.

So I checked out the query plan in the procedure cache. Below you can see the result:

Do you see where it shows a Nested Loops (Inner Join) to the right? That's the one join in this query. Do you see where the two tables feeding it have numbers beside them? Those are the number of rows that each table fed into the loop. 1,285,470 and 1,285,471 ... Well, do you know how a nested loop works? Essentially the formula for the number of iterations is number of rows in table A times number of rows in table B. Quick math problem: what's 1,285,470 times 1,285,471? The answer is 1,652,434,406,370 or 1.652 trillion.

OUCH! SQL Server actually made the incredibly bad decision to choose a plan that iterated 1.652 trillion times! I know this production server's 4-core processor is fast, but no wonder this query was taking upwards of 8 hours.

So why did SQL Server choose this plan? Possibly bad statistics. Here's the relevant portion of the estimated execution plan:

If you look at the right hand table's properties, Estimated Number of Rows = 1. So, 1 times 1 million is not so bad especially for a nested loop. That would have been reasonable assuming there was actually one row in the right hand table, but there wasn't. Since the estimated plan (which is based off statistics) decided there was only one row in the table on the right hand side of the join, the query planner chose a nested loop instead of a hash join.

Normally, I could stop the ETL job, update stats, and then continue on my merry way with the rest of the ETL job. I had no such luck this morning, unfortunately. I ran exec sp_updatestats on both our data warehouse and our staging database, but for some reason I didn't get any better query plans -- perhaps a full statistics update of the affected tables would have worked better, but I didn't have time and I knew a quicker solution. Also, I didn't want to flush the procedure cache (a.k.a., the plan cache) because this is a production server.

How frustrating! SQL Server does this to me all the time, by the way. I wish Microsoft would change the way statistics work: if statistics are not available or are out of date (which is clearly the case here), then do a quick table scan to estimate how many rows are in the table! If it's more than some threshold, don't choose the nested loop! It seems like statistics default to 1 row when they are bad or missing. Some of my tables have 770 million rows in them! Defaulting to 1 row every time a statistic goes bad is catastrophic.

My solution: I'm forcing the query to use a hash join with the hash join query hint: INNER HASH JOIN. Some database administrators might balk at the idea of using query hints, but I say that due to flaws in SQL Server's statistics system and query planner, query hints are an absolute necessity. Incidentally, this is the 3rd time this week that I've been bitten by bad query plans and it's only Wednesday.

Outcome: The query that didn't finish in 8 hours last night with a nested loop join finished in 2 minutes with a forced hash join. I know that I will never have fewer than 1 million rows in either table so I know that the hash join is the only physical join that makes sense, regardless of SQL Server's bad statistics.

No comments:

Post a Comment