Friday, September 6, 2013

Random Surrogate Keys vs Sequential Surrogate Keys

I was just reading through (for the first time in history) a pretty good Microsoft training book (Querying Microsoft SQL Server 2012) and on page 43 I read something new to me. "Consider nonsequential keys, such as random ones generated with NEWID..."

Huh? Keys are always sequential, right? Shouldn't they be? We've all been taught that since day one.

But here's a benchmark:

Shocking -- I know!

Anyway, it has to do with the way SQL Server handles indexes (and of course, you'll almost always have an index on your primary key, which is almost always a surrogate key). If you're inserting new rows into a sequential index (which is most likely the case), you're going to add a new page presumably to the end of the index ... over ... and over ... and over ... and over. Why? Because the last page is going to fill up every few inserts or so. That's cool, except it's not because that page is always going to be latched ... so presumably there's no paralellism going on there? And if you're running an I/O subsystem with more than one disk (which most enterprise applications are in one form or another), then you're not going to benefit from those multiple disks.

On the other hand, if you're inserting rows with a random identifier as the key, the insertion point is also random and therefore 1) multiple processes/threads can write to (and read from) the table it at a time because there is no lock on the page they are trying to access 2) page locks/latches won't occur on the hot spot and interfere with each other and 3) and I/O subsystem is not constantly blocked waiting for the last page to be unlatched.

So I learned something today and it completely shocked my way of thinking about SQL Server. The question now becomes this: how do bulk inserts (say from SSIS) deal with this phenomenon? Will a bulk insert into a table with a random key perform better than one with a sequential key? Next time ...