SyntaxHighlighter

Tuesday, December 31, 2013

My New Years Resolution: Learn About Massively Distributed Database Systems

Database scalability seems to be a problem for many database administrators and data managers. The problem is not "how do I do it" or "what is it" but rather "what do I do with it?" I've worked with the smallest databases in many different traditional RDBMS platforms and I've worked with some pretty huge databases. My last job had several terabyte databases sitting in a single database server -- no clustering. I currently work in an environment with several small databases sitting on an over-sized cluster. Frankly, I think they have it backwards. The big environment should have been clustered and the small environment doesn't need to be. But I digress. The point is, I don't think most database architects know what to do with scalable systems because, frankly, they don't have enough data in their small to medium shops. But that will change -- and soon!

In order to grow and to perhaps work in a bigger shop that has the need for massively distributed data, we have to think bigger and learn about the world outside of our small ponds. Not only that, but our small ponds will be getting bigger. The data landscape is changing quickly.

I'm thinking about this today because I read an article about InfiniSQL (actually, more of a press release) from its principal developer. Some of the concepts didn't make a whole lot of sense to me because I've never worked in an environment that distributes data to multiple geographical locations. Thinking back on the articles I've read recently, it seems like distribution of data is going to be the big problem to solve in the future and it seems like an important niche to dive into. After all, last year's contender for word-of-the-year was Big Data.

But Big Data is conceptual in nature and very few people understand what it really is. Big Data is not a terabyte or two terabytes in a database cluster -- not even ten or twenty or fifty or a hundred. Big Data is more like a hundred terabytes per day. The actual storage systems are dozens to hundreds of petabytes in size. Big Data was actualized by massive players -- Yahoo and Google and maybe even Pixar. I found this out by reading Big Data: A Revolution That Will Transform How We Live, Work, and Think.

What I would like to talk about here is not Big Data. Big Data is going to get bigger -- that's true. Next decade's databases will probably be 10-100 times bigger than the ones we have now, even for small to medium sized firms. Since storage is so cheap we've all begun to hoard data. But in its most fundamental format data -- and especially Big Data -- is less than useless!

The time for some is now -- and the time for the rest of us is soon -- when we will actually have to start leveraging our treasure mounds of data for something beyond simple reports. The new buzzwords are Data Visualization, Data Mining, and the like. We don't just need to store the data, we need to extract useful information from it.

Sure, Data Visualization and Data Mining have been around for a while as concepts but they haven't really entered the lexicon of the common database administrator. The reason is because DV and DM have been so difficult to accomplish without a degree in higher mathematics -- seriously! Have you ever wanted to compare the graphs of two time periods to detect their "sameness" or "differentness?" There's a reason we draw graphs and give them to analysts to interpret -- because the human brain can do so much more with two simple graphs today than a team of computer scientists. The algorithms and mathematics are out there, for the most part, but where data, computer algorithms, and higher mathematics intersect is the trivium (the crossroads) where only a few have been able to tread successfully.

Those people, with that trinity of accessible skills, are the ones who make $400-$500k annually as developers. They make so much money and live such charmed lives because conceptually it's such an exquisitely difficult nirvana to reach. It's hard enough learning how to program, and even harder to understand the intricacies of data, and harder still to master the highest orders of mathematics. Put them together and you have a requirement of 20 years of intense college education peppered with summers of internships and late and lonely nights of coding and experimentation. Unfortunately, twenty years in this industry is a lifetime and everything will have changed in 20 years.

That's why it's important to keep up with the current trends in the data sciences. Sure, there are a few geniuses here and there who have the right mix of skills to command $500k (in a non-executive role) but they are far and few between. For the rest of the world, there are the data scientists who will earn $150k, $125k, $100k, $75k per year. They are not the ones who develop the database software or the analytics software necessarily (albeit there will be some small-scale development for those who can push themselves). Rather, they will be the ones who learn to listen to the tracks to hear the oncoming trains. In other words, they will be the ones who learn to leverage existing software and push their small shops to the bleeding edge by testing prototype systems in strange and new combinations to see what works and what doesn't.

So here's my New Year's resolution. 2014 will be the year in which I push my skills into new territories. I don't just want to read about Big Data, Data Visualization, Data Mining, Statistical Analysis, Distributed Systems, etc. I want to learn their intricacies and nuances and I want to listen to what the data science community is saying about them. I want to learn about horizontal scaling, vertical scaling, large scale clustering, and anything having to do with storing and mining that data for useful information. Perhaps next year's resolution will be to come up with a problem to solve. Until then!

Friday, December 20, 2013

CONVERT_IMPLICIT in a JOIN or Subquery

Implicit conversion can be a big problem in SQL Server that can slow down your queries by orders of magnitude -- even with relatively low row counts.

First of all, what is implicit conversion? When two values are compared in an expression, the data types should match. If they don't match, SQL Server will implicitly (without being told) convert one data type into the other so that they match. After they match, then SQL Server can make the comparison.

Implicit conversion can be identified and corrected in many cases and it's most easily done in the query itself with explicit CAST and CONVERT statements. But what happens when joining two tables together with keys of different data types?

When sets are overlapped (with a JOIN or with a subquery in an IN operator), each set has a key which correlates them together. If those keys are the same data type (but not simply compatible data types), then indexes which exist on the keys can be directly used to speed up the correlation. However, if the two data types are not the same, SQL Server must implicitly convert the data types and it can no longer use those indexes. So basically, all that work you did to make those indexes finely tuned and optimized gets thrown out the window because they will never be used.

Look at this snippet of an execution plan:


Above, an implicit conversion is happening inside the COMPUTE SCALAR nodes in this execution plan. You can see that almost half a million rows are being scanned from Table3. Notice that the compile time is almost 4 seconds. Client statistics of 10 runs revealed that this query takes an average of 2892ms to run.

COMPUTE SCALAR is actually converting nvarchar(15) in Table3 to char(10), and it's doing it 500,000 times! Is all that work really necessary?! Not at all.

Look what happens when I change the data type of the base table Table3 to char(10). Note that SQL Server will complain about this conversion because it could result in a loss of data, but in my case, I verified that it would not have.


Above, first notice that the anonymizer has changed Table3 to Table4. Furthermore, there is no implicit conversion and the indexes can and in fact are being seeked. Instead of scanning 500,000 rows, you now see it's seeking 112 and 7. A tiny change has radically altered the dynamics of this query. Notice other stats, like the fact that it ran in almost 1/40th (3%!) of the time. Client statistics confirm the speedup.

Your mileage may vary, of course, but always look out for CONVERT_IMPLICIT!


Thursday, December 5, 2013

Adding Existing Queries and Miscellaneous Files to Source Control in SSMS with MSSCCI

Wow, that title is a mouthful, isn't it? It makes me cringe a little bit that I had to write it!

So, you know what SSMS is, right? SQL Server Management Studio. That's where you write most of your queries. If you've used this product extensively you know that managing your queries can be painful -- you probably rename your query 50 times with some combination of the date or keywords or something similar.
  • MyQuery.sql
  • MyQuery 2013.12.05.sql
  • MyQuery By Week.sql
But if you use TFS (Microsoft's Team Foundation Server), you can quite happily add query projects to source control just like any other development project. 

This post is not about how to set up your source control -- There is a great blog post called Use Team Foundation Server (TFS) as your Source Control in SSMS by Dan Thompson that explains it quite well. It uses an interface called MSSCCI (I'm guessing Microsoft Source Code Control Interface?)

This post is actually about one major oversight in the interface: you can't add existing items, such as SQL files from other locations or query result files that you'd like to keep track of. If you right click on Queries or Miscellaneous you'll notice that you only have the option to create new items. Bummer, right? 

There is a solution. There are actually two. You can either edit the project files manually (not described here) or you can take the easy way out and simply drag and drop your files from a Windows Explorer window.


Awesome, right? Now you have a fairly complete source control solution and you can manage your SQL files and the related miscellaneous files!

Tuesday, December 3, 2013

Transform a Date Range into a Time Series

It's been a few weeks since I last posted anything because I've relocated to a new company. The transition was smooth but I haven't had time to post. Now, however, after having dived right in, I'm back up and running and writing big queries for big reports.

Yesterday I received a data request asking for a time series dataset, but our management system only stores data in ranges. For any given range of time X, they wanted to see the data for that time period broken into a time series -- in this case, by month. Conceptually, it's similar to an UNPIVOT statement in that it's taking two columns (a start and an end date) and turning it into X number of months as rows.

Here's an example of what the range looks like (table A):

BEGINDATE ENDDATE DATA
2013-01-01 2013-04-04 8

And here's what the output should look like (table Z):

FIRSTDAYOFMONTH DATA
2013-01-01 8
2013-02-01 8
2013-03-01 8
2013-04-01 8

Brilliant. So the result looks somewhat like a table that has been unpivoted, but after careful analysis, it's not the same transformation. What really has to happen is that, given the number of months in the date range, that's how many rows need to be created.

There are a number of facilities for performing this action in T-SQL, given that T-SQL has many iterative features (cursors, while loops, et cetera) so this problem could easily be solved that way. However, it can also be solved in a single query given that SQL Server includes a number of set-based features that can perform this task. You have to understand the problem and use the right tool for the job.

What is the right tool for the job? Well, after more careful analysis, we realize that the core of this problem is creating a variable number of rows in output. Are there any tools in SQL server that will do such a thing? Yes! JOIN keywords are perfectly capable of turning one row into many in a one-to-many relationship. On the left hand side of the join we need a "one" table and in the right hand side of the join we need a "many" table. We have the "one" table -- table A. But we need the "many" table -- table B.

There are several ways to create table B, but the way I prefer is to simply use the ROW_NUMBER window function. Just somehow manage to get a bunch of rows, then number them. Here's a fairly elegant way to do this:

 
 SELECT TOP(1000)
  ROW_NUMBER() OVER( ORDER BY ( SELECT 0 ) ) - 1 AS NUMBER
 FROM
  sys.columns a
 CROSS JOIN 
  sys.columns b
 CROSS JOIN 
  sys.columns c
 CROSS JOIN 
  sys.columns d

It works by creating a huge number of rows and then numbering them. In other databases, such as PostgreSQL, you could theoretically use a function like generate_series(). Or you could create a table valued function in SQL Server that does the same thing. It makes no difference. What does make a difference, however, is the TOP(1000) qualifier. This should be the maximum number of time units you ever expect to see. 1000 months is about 83 years, but 1000 seconds is about 17 minutes.

So now we have table B. All you have to do now is join them together. But wait, table B  returns a bunch of numbers, not months!

The trick, then, is to join on numbers, not months. Does SQL Server have a function for converting months to numbers? In fact it does. DATEDIFF will handle this quite easily.

 SELECT 
  DATEDIFF( MONTH, '2013-01-01', '2013-04-01' )

The result, as you would expect, is 3. There are three months between January and April (and there would be 0 months between January and January of the same year).

So now, you can join on numbers (pseudo-SQL below):

 
 ...

 FROM
  DATA D
 JOIN
  NUMBERS N
 ON
  N.NUMBER = DATEDIFF( MONTH, '2013-01-01', '2013-04-01' )

The problem here, of course, is that only the unique values join together and you end up with one row, so what you ultimately want is a range join, meaning the join can occur on a range of values, in this case, any number between 0 and DATEDIFF(...). Since 0 is implied because ROW_NUMBER starts at 1 (and we subtracted 1 leaving 0 as the starting point).

 
 ...

 FROM
  DATA D
 JOIN
  NUMBERS N
 ON
  N.NUMBER < DATEDIFF( MONTH, '2013-01-01', '2013-04-01' )

So now you understand the concept if I've explained myself correctly. You can composite the joins however you want. I prefer to use a CTE to make the code look cleaner, but you could just as easily use any form of derived query (CTE, subquery, table valued function). Here's my final product:

-- create a CTE (common table expression - a "derived query") for the numbers table
-- NOTE: the numbers table will never change unless you need more than 1000 units of time 
-- (i.e., 1000 months) 
WITH 
 NUMBERS
AS
(
 SELECT TOP(1000)
  ROW_NUMBER() OVER( ORDER BY ( SELECT 0 ) ) - 1 AS number
 FROM
  sys.columns a
 CROSS JOIN 
  sys.columns b
 CROSS JOIN 
  sys.columns c
 CROSS JOIN 
  sys.columns d 
),

-- create a CTE (common table expression - a "derived query") for the data
 DATA
AS
(
 SELECT
  '2013-01-01' AS BEGINDATE,
  '2013-04-01' AS ENDDATE,
  8 AS DATA          
)

-- work with the derived queries as if they were tables (whose aliases are NUMBERS and DATA)
SELECT

 -- Get the first day of the month with a convoluted calculation -- this would be much easier
 -- in SQL Server 2012 because there is an EOMONTH function
 CAST( DATEADD( MONTH, N.NUMBER, D.BEGINDATE ) AS DATE ) AS FIRSTDAYOFMONTH,
 D.DATA
FROM
 DATA D
JOIN   
 NUMBERS N

-- this is where the magic happens ... join on the numbers table which is simply a sequence
-- of numbers from 0-999 with the difference between the dates in terms of months.
-- So, the difference between January and March is (3-1) = 2, so the join will occur 
-- between 0 and 2, that is 0, 1, and 2 -- thereby creating 3 rows, one for each month
ON 
 N.NUMBER <= DATEDIFF( MONTH, D.BEGINDATE, D.ENDDATE )

-- this is just limiting the query so that you won't see months after the current
-- month; remove this for projections into the future
WHERE
 D.ENDDATE < GETDATE()
ORDER  BY
 FIRSTDAYOFMONTH

Update: Some ideas for this post came from SQL how to convert row with date range to many rows with each date.

Monday, November 4, 2013

SQL Server 2012 SP1 Needs NetFx3 (.NET Framework)

Today I imaged a brand new copy of Windows 8.1 and started installing software. First up, SQL Server 2012 SP1. Halfway through the installation, I received a typically cryptic Microsoft error message:

Microsoft SQL Server 2012 Service Pack 1 Setup

The following error has occurred:

Error while enabling Windows feature : NetFx3, Error Code : -2146498298 , Please try enabling Windows Feature : NetFx3 from Windows management tools and then mrun setup again. For more information on how to enable Windows features , see http://go.microsoft.com/fwlink/?linki=227143
Wow, that was particularly bad. It almost seems like it was written by a reptilian. No matter, the message provided almost nothing useful unless one knows how to decode internal Microsoft lingo.

This one says "You need to install .NET framework 3.5, which comes with Windows, but isn't installed by default."

In Windows 8.1, press the Windows key (on your keyboard - the one that looks like a flag between the Ctrl and Alt keys), type the word "features" and after the search is done, you will see "Turn Windows Features On or Off". Click that.

Next, scroll through the list (actually, it's at the very top). Click ".NET Framework 3.5 (includes .NET 2.0 and 3.0)". Then click OK.



After the installation is complete, you should be able to simply continue your SQL Server installation without a problem. Unfortunately, after the feature installation, you need to restart your SQL Server installation (sad face). Note that SQL Server Setup will let you continue -- but the important features you are looking for (the database and management tools) will not be installed and you will need to uninstall and reinstall to get back to what you want. Best to just cancel it and start over!


Thursday, October 17, 2013

Forensic Query Pathology 201: Exceptionally Bad Query Plans

This one is rather difficult to explain without a diagram, so here it is: 


Here is a SELECT query. You can see that it gets a fair number of rows from the base tables/clustered indexes -- approximately 6 million. But then something odd happens. The filter takes out approximately 5 million rows and leaves the query with only 1 row. You are probably saying to yourself, "That's perfectly normal. You've clearly written a WHERE clause that filters out all the rows."

That's reasonable.

Next, the same SELECT query wrapped by an INSERT DML statement:


Interesting. So the query optimizer believes we are inserting a single row into this table, so it chooses nested loop joins for the foreign keys -- and that makes a lot of sense because with only one row, the best join would be a nested loop.

All's well that ends well right? This INSERT statement takes only four minutes to run because the SELECT grabs 6 million rows, filters it down, and inserts only a single row.

No.

Here's why. It turns out the original query plan was bunk. It turns out that the Filter in the original query plan completely messed up its estimate. It was supposed to estimate 6 million rows, because that's how many actually passed through the filter. And since 6 million rows in any sort of nested join usually end up multiplying with the number of rows in the join tables (at least in this case), we actually had something on the order of 3.6 trillion rows to loop through ... So in fact, the query was taking dozens of hours to insert those records.

I am really beginning to hate the SQL Server query optimizer.

So now you're asking yourself if updating the statistics helped, and the answer is no. We have so many hundreds of millions of rows in the base tables that inserting a fraction of a percent during ETL per day makes the stats engine completely underestimate them to the point where they are zero. This is a big problem for Microsoft's query planner and an even bigger problem for us. The actual solution was to dematerialize the query (in this case I used a temp table) and then the estimate for insertion was 100% accurate and the proper merge joins/hash joins were selected by the query planner.




Wednesday, October 16, 2013

Installing BIDS 2008 R2 with Visual Studio 2008 -- SP1 is definitely Required

A few days ago the hard drive in my work laptop crashed. Yay!

So I went to MicroCenter and got myself a Samsung 840EVO. Yay! (No, the company didn't pay for it because they're cheap like that -- but $100 for my sanity is an easy price to pay -- the time I saved with an SSD install and the peppiness of the system would easily repay itself manifold in terms of my pay, but try to convince a centralized IT department of that).

I downloaded Visual Studio 2008 from MSDN. Then I went to install SQL Server 2008 R2 -- at least the BIDS (Business Intelligence Development Studio, later known as SQL Server Data Tools in SQL Server 2012) portion of it (I have to work in older reports sometimes). But BIDS wouldn't install because it didn't like that Visual Studio 2008 SP1 wasn't installed. Huh? Didn't I get SP1 with my MSDN download? The answer is no!

I poked around for a while and realized after some knuckle dragging that Windows Update doesn't update VS2008 and MSDN doesn't offer a slipstreamed SP1 version. You actually have to manually download VS 2008 SP1 and manually install it. The horror!

Here's what the error messages looked like:

Previous releases of Microsoft Visual Studio 2008 Failed
A previous release of Microsoft Visual Studio 2008 is installed on this computer. Upgrade Microsoft Visual Studio 2008 to the SP1 before installing SQL Server 2008. Visual Studio SP1 can be found at http://go.microsoft.com/fwlink?LinkID=159627.
Installing Service Pack 1 for Visual Studio 2008

Now that I've installed SP1, all is well. But keep in mind it's not an automatic process so you have to do it manually. PS, uninstalling Visual Studio does not fix the problem as this MSKB article describes. The only way to fix it is to install a working copy of VS and then install SP1 on top of it and then install BIDS.

Oh -- and spend the money and get yourself an SSD. For this article I uninstalled and reinstalled VS2008 3 times (including the service pack) in less than an hour! You'll never go back to a spindle again for your primary OS drive -- I guarantee it.

Friday, September 6, 2013

Random Surrogate Keys vs Sequential Surrogate Keys

I was just reading through (for the first time in history) a pretty good Microsoft training book (Querying Microsoft SQL Server 2012) and on page 43 I read something new to me. "Consider nonsequential keys, such as random ones generated with NEWID..."

Huh? Keys are always sequential, right? Shouldn't they be? We've all been taught that since day one.

But here's a benchmark:

http://blog.kejser.org/2011/10/05/boosting-insert-speed-by-generating-scalable-keys/

Shocking -- I know!

Anyway, it has to do with the way SQL Server handles indexes (and of course, you'll almost always have an index on your primary key, which is almost always a surrogate key). If you're inserting new rows into a sequential index (which is most likely the case), you're going to add a new page presumably to the end of the index ... over ... and over ... and over ... and over. Why? Because the last page is going to fill up every few inserts or so. That's cool, except it's not because that page is always going to be latched ... so presumably there's no paralellism going on there? And if you're running an I/O subsystem with more than one disk (which most enterprise applications are in one form or another), then you're not going to benefit from those multiple disks.

On the other hand, if you're inserting rows with a random identifier as the key, the insertion point is also random and therefore 1) multiple processes/threads can write to (and read from) the table it at a time because there is no lock on the page they are trying to access 2) page locks/latches won't occur on the hot spot and interfere with each other and 3) and I/O subsystem is not constantly blocked waiting for the last page to be unlatched.

So I learned something today and it completely shocked my way of thinking about SQL Server. The question now becomes this: how do bulk inserts (say from SSIS) deal with this phenomenon? Will a bulk insert into a table with a random key perform better than one with a sequential key? Next time ...

Wednesday, August 28, 2013

A Really Good Definition of Database Selectivity

The Problem

Selectivity in a database is a truly important concept -- fundamental, to some degree. But it's also very abstract, weird, hard to understand, and hard to grasp. Of course, once you grasp it, you've got it forever. Database selectivity is like learning to ride a bicycle for the first time or learning to tie shoelaces or learning to walk even. So why is it that so many tutorials and blogs and websites completely skip over this or -- even worse -- give some arcane technically jargon-filled definition? This article is to help you understand database selectivity from many points of view and with many examples. This article is about one concept and one concept only.

The Definition

Honestly, I'm not even sure a definition is going to be helpful. Words are abstract representations in the first place and using them to define esoteric concepts can cause panic in some learners (like myself). The reason technical definitions are confusing is because words have no inherent meaning on their own -- they, like data in a database, are relational. They are related to memory and conceptual structures in your brain which are connected to the memory and conceptual structures of other words in your brain. If those memories and conceptual structures and connections don't already exist, then you'll have to build them before you can understand a definition.

Look! It's a kitty cat! Meow!
A definition is like a puzzle: if you've got 99% of the picture finished and you are just missing the last piece, great -- you can visualize the big picture and putting the last piece in is easy. 

But what happens when you're first starting out with a bunch of puzzle pieces that have no relationship because they're all jumbled up? Well, you have to spend a lot of time fitting them together, making sure the colors and the shapes line up. It takes time, effort, and perhaps patience to overcome the initial frustration.

Enough of the puzzle metaphor. Instead of one definition, I'm going to give you as many as I can find. 
  1. The degree to which one value can be differentiated within a wider group of similar values.

    This is fairly esoteric. I think this definition makes my head hurt a bit. First of all, what sort of degrees are we talking about? It's a polysemous word. Polysemous is just an adjective that means "a word with many meanings, some of which might be unrelated." Differentiated? Wider group? Without a frame of reference, this one is very confusing.
  2. In SQL, the term selectivity is used when discussing database indexes. The selectivity of a database index is a numeric value that is calculated using a specific formula. That formula actually uses the cardinality value to calculate the selectivity. This means that the selectivity is calculated using the cardinality – so the terms selectivity and cardinality are very much related to each other. Here is the formula used to calculated selectivity: Selectivity of index = cardinality/(number of rows) * 100%

    I actually like this definition quite a bit because it's a mathematical formula. Unfortunately, it requires an understanding of cardinality prior to understanding selectivity. That makes it difficult if you don't have a grasp on cardinality, which is another fundamental concept in databases.
  3. Index selectivity is a number the determines the effectiveness of index. Best possible selectivity is when all table records have a different value for the columns in index (this is typical for primary keys and unique constraints).

    This is another highly technical definition. You need to know all kinds of concepts before you can get anything out of this one. Not only that, but it's badly phrased and written in a non-standard dialect of English. 
I could probably go on for quite some time with definitions from around the interwebs, but frankly I'm tired already. Google has failed me! Either that, or the definitions out there kinda suck.

So here's my definition of database selectivity: In any given set of data, selectivity refers to the percentage of how many unique values exist compared to how many pieces of data exist. 

Here's another way to say it: Selectivity is the ratio between unique values and the number of pieces of data (in any given data set).  

Finally, here's the formula without using other terms you might not be familiar with yet:
Selectivity = number of unique (or distinct) values / total number of values 

Attributes of Selectivity

It really refers to a number, sometimes expressed as a ratio or a percentage. This number is produced based on a formula. The formula is really simple and you should memorize it thoroughly.

High selectivity means the data have a lot of unique values and the number is closer to 1 or 100%. The adjective selective refers to high selectivity in a set of data. You might also see someone write about data being very selective

Low selectivity means that the data have very few unique values and the number is closer to 0 or 0%. 

Examples


Example Using a Table

It might be helpful to understand what unique or distinct values are. If you examine the table on the right, you'll see three columns worth of data. A child's name, what kind of pet he or she has, and the town he or she lives in. 

How many different children are there? There are 8 different children but there are only 7 names. Notice how John is repeated. It doesn't mean they are they same child, just that they have the same name. Again there are 8 children, but only 7 distinct names.

How many different kinds of pets are there? There are 3 kinds of pets: Dog, Cat, Fish. However, there are 8 pets total, they just happen to be the same type. In other words, there are 8 pets and only 3 distinct types of pets. 

How many different towns are there? In this case, there is only one town: Boise. This problem is a bit different: there is only one town, but there are 8 rows of data.  Selectivity refers to the number of rows or values, not to the number of objects being represented. There is 1 distinct town name listed here, but there are 8 rows. 

Using the definitions and information provided above, can you answer these three questions?
  1. Is the child's name column highly or lowly selective and why?
  2. Is the pet's type column highly or lowly selective and why?
  3. Is the town's name column highly or lowly selective and why? 
Answers: 
  1. Highly selective -- because the percentage is closer to 100% (percentage in these cases referring to the number of distinct values divided by the total numbers of values in a column).
  2. Somewhat selective -- because the percentage is closer to the middle or 50%.
  3. Lowly or not selective -- because the percentage is closer to 0%. 
Are you starting to see the kind of relationship that defines selectivity? 

Example Using Circles

To the left we have two circles.

The big circle represents all your values.

The little circle represents only the distinct or unique values.

In this particular circle, what is the ratio or percentage of distinct values to all your values? Is it highly selective or lowly selective? Is it closer to 100% or closer to 0%? This diagram represents low selectivity because the percentage is closer to 0%

In the next set of circles, what is the ratio or percentage? Is this highly or lowly selective? This diagram represents high selectivity because the percentage is closer to 100%.

Example using a Database Query

I'm going to give you a query that you can use to figure out the selectivity of a column. Just replace the column name and the table name with any column or table in your database.

SELECT 
    COUNT( DISTINCT ColName ) AS DistinctValues,
    COUNT( ColName ) AS TotalNumberOfValues,
    COUNT( DISTINCT ColName ) / COUNT( ColName ) * 100.0 AS Selectivity
FROM 
    TableName;

I ran this query on two different columns on the same table in my database. On the primary key (which by its very nature has 100% distinct values) and on a date column.

Looking at the results, you can see that one has a very high selectivity (100%) and the other has very low selectivity (0%). Can you explain that? If you can, then you are probably getting the hang of selectivity.

Here's another question: which one of the results represents the primary key of the table and why?

Why Does it Matter?

The reason any of this matters is because understanding the selectivity of your columns and being able to calculate it will make your ability to build indexes much more fact- and information-based. You will no longer simply throw an index on a table because you hope it will help the performance. You will use good judgement and knowledge. (Use the force, Luke!)

The fact of the matter is that the higher the selectivity on an index (or in other words, the more unique values it has), the better that index will work for filtering data and joining tables. The reason that's true is the subject of another article altogether, so I will leave that for another day. In the meantime, try to understand selectivity and return to this article and the many more advanced articles on the interwebs to try to deeply ingrain this concept in your databasing soul! 


Tuesday, August 27, 2013

SSRS and the Fountain of Useless Errors: Conversion from type 'CalculatedFieldWrapperImpl' to type 'Integer' is not valid

Hey, I'm no novice. I've been doing this computerish stuff for a good long while. But Microsoft SQL Server Reporting Services (SSRS) stumps me on a regular basis. I use BIDS (Business Intelligence Design Studio) and SSDT (SQL Server Data Tools) on a regular basis -- weekly, sometimes daily. It seems like it gives the worst, most cryptic, most useless error messages of all time. Well, besides the error messages that come from Linux, of course. 

Here's the doozie I got today:

The Sort expression for the grouping 'FUBAR' contains an error: Conversion from type 'CalculatedFieldWrapperImpl' to type 'Integer' is not valid.


So what had happened was this ... I had built a query (a view actually) in SQL Server and added it to the report and used it in a matrix. Then I decided I needed more columns, so I modified the view and then modified the data set. Then I tried to use the new columns to sort a group. When I tried to preview, I got this gloriously useless message.

I tried a lot. Googling, Binging, whatever else one does in these situations. I took a coffee break and thought about it. I then realized that perhaps since I changed the source of the data that the data cache had gone bad. So I found the file report.rdl.data in my project folder and deleted it. 

That did the trick.

Remember kids, anytime you change the underlying data source, delete the cache from BIDS or SSDT. It will cure a huge number of problems in SSRS.

Thursday, August 22, 2013

SQL Server Database Mail: Make sure Database Mail Procedures, Profiles, and Operators Exist

If you'd like to send an email from SQL Server, you can do it using database mail, presuming that database mail is installed.

If you'd like to send a notification to an operator from a stored procedure or any other T-SQL code, you should use the procedure sp_notify_operators.

But what happens if database mail isn't installed? You will get an error that the procedure doesn't exist. You could try to catch that error with a BEGIN TRY...BEGIN CATCH block, but you could also test to see if the procedure exists.
 
 IF EXISTS
 (
  SELECT
       *
  FROM
       msdb.sys.objects
  WHERE
       type = 'P'
       AND name = 'sp_notify_operator' 
 )
 BEGIN

  -- do something here, such as calling the procedure

 END

You could also check to see if the email profile or the operator exists.

 -- Make sure the profile exists
 IF NOT EXISTS
 ( 
  SELECT
      *
  FROM
   msdb.dbo.sysmail_profile
  WHERE
   name = @email_profile_name
 )
 BEGIN
 
  RAISERROR( 'That database mail profile does not exist.', 16, 1 );
  RETURN;

 END

 -- Make sure the operator exists
 IF NOT EXISTS
 ( 
  SELECT
      *
  FROM
      msdb.dbo.sysoperators
  WHERE
   NAME = @operator_name 
 )
 BEGIN
 
  RAISERROR( 'That operator name does not exist.', 16, 1 );
  RETURN; 

 END

Wednesday, August 21, 2013

SQL Server and the Quest for the Intermediate Materialization Query Hint

The Adventure

I feel like I'm Indiana Jones and I'm hunting for a lost treasure: intermediate materialization. It seems like I do that a lot in SQL Server -- I'm probably on my 5th or 6th sequel by now (get it? SQL? SEQUEL? sequel?). Most of the time there is some kind of awful kluge to workaround the SQL Server team's stubborness. They wrote SQL Server that way, and it's just fine the way it is!

An example that's unrelated: paging results. In the old days, paging results was a nightmare that had to be done with the TOP keyword, sorting, and several subqueries. I shudder for those days. Then the window functions came and ROW_NUMBER() saved the day. Finally, in SQL Server 2012, Microsoft's dev team gave in and we got the OFFSET-FETCH keywords. (Funny, since MySQL and PostgreSQL have been making our lives easier with such keywords as LIMIT and OFFSET for ... decades?)

But one thing that doesn't seem to be on the table yet is (dun dun duuuun!) intermediate materialization. Sure, Microsoft has said they'll [try to think about maybe] support a query hint in the future for intermediate materialization, but that request was made in 2006 (!!!!!) and still hasn't arrived yet.

The Problem

Let's say you're diving into execution plans pretty deeply. You find there are some problems inherent to SQL Server's query optimizer -- it just doesn't do what you want it to do. 

You find that you can run some part of your query in a new connection window (that will later become a subquery). It runs in 3 seconds instead of 90 minutes. Voila! You have the answer. All you have to do now is to integrate the subquery into the main query. 

The problem starts after you stick your standalone query into a subquery in your view and you join it to the main query. Then you run it and you realize that it's now running at 90 minutes again! How frustrating, but why?

The reason, you discover after analyzing the query plan, is because the query optimizer has rewritten your subquery into a series of joins, undoing all the hard work you did by creating a subquery in the first place (and making your query run 10000% longer in the process).

You know there are such things as query hints, so you go off looking for a query hint to solve this problem. The problem is, you don't know what this problem is called. You search for subquery optimization in SQL Server and stop SQL Server from optimizing my subquery and SQL Server is rewriting subqueries and on and on until your fingers get numb. Then you discover that subqueries are sometimes called virtual queries so you search for SQL Server rewrites virtual queries. At some point you stumble upon some articles relating to materialized queries and you remember that you've heard those words before but you don't know what they mean because your database professor covered that topic on the Monday morning after the big football game. 

What is Materialization?

In databases, materialization is the process of gathering data from physical sources and creating results in the form of a relation (a table). It contains a snapshot of data necessary at that moment in time. Any query or view that produces results is referred to as materialized. Wikipedia has a good explanation of a materialized view.

It seems abstract now, but it's really a simple concept. It's like arrays or objects. Eventually you'll just get it. I suggest you try really hard to understand it now so that the learning will come faster because this is a really useful concept in your database life. It's like playing Super Mario Brothers. You just couldn't get past level 8-4 the night you tried for 6 hours, but the next morning after a good night's sleep, you got through it in one try.

What is Intermediate Materialization? 

Intermediate materialization is when query results must be stored temporarily before processing the remainder of the query. In other words, SQL Server has to stop joining pre-existing data sources, create a temporary table, and then join to that temporary table. 

Let me ask you a question? Have you ever created a temporary table from a SELECT statement and then joined that temporary table to another table? If your answer is yes, then you have just carried out the poor man's version of intermediate materialization

The problem is that SQL Server does not support intermediate materialization ... And in fact, if other bloggers are to believed, the SQL Server dev team is actively trying to make it harder for us to use intermediate results in a single query. [aside: I have a theory that the reason Microsoft doesn't want to add this query hint is because it would create a non-relational phase in the logical processing steps of executing queries.]

Solution

The old fashioned way to do this was to use TOP 100 PERCENT but evidently that went away in SQL Server 2005. As Adam Machanic suggests, using TOP ( 2147483647 ) but that was for SQL Server 2005, and who knows if it works in 2008 R2 or 2012?

Update 2013.08.28: Yes, it does seem to work in 2008 R2. Your mileage may vary. 

Another user on stackoverflow suggests using the CASE statement (and might I suggest that you should probably use it on your join keys) and I've had some luck with that in the past. Keep in mind, however, that if you're joining millions of rows that each one of them has to be cast and processed as an expression and that could slow your query down in its own right.

Example

I had a query that was taking 80 minutes. I tracked its duration over time and it was slowly increasing. I optimized the subquery so that it took 5 minutes but in the entire query the subquery was being optimized as a series of joins. After tricking SQL Server into intermediate materialization, I was able to produce the query results consistently in about 6 minutes. This chart demonstrates the massive savings I was able to get by using intermediate materialization. Your mileage may vary!

Query Plans Gone Bad: The Case of the Nested Loop and the Trillion Operations

At work, we have at least three huge ETL (extract, transform, load) jobs. They usually run within a normal 2-3 hour time frame. Yet every once in a while there's a crazy spike and those spikes ruin my day because the ETL jobs are laughably complex and there are downstream reports. Plus these ETL jobs run daily, so they can't pile up ...

What's causing those spikes? The short answer is bad execution plans and bad statistics.

You see, a query (a very simple INSERT query with a single joined table) was holding up one ETL job . The query normally runs in two minutes, but this time it had been running for 8 hours overnight. I started getting "over average" alerts around 6 am. To the left you can see that the 4-core CPU had been pegged until I killed the query. There was almost no disk I/O and there was nothing unusual about the memory or the network. Just a solidly pegged CPU for 8 hours. Boy, that CPU must have been hot.

So I checked out the query plan in the procedure cache. Below you can see the result:



Do you see where it shows a Nested Loops (Inner Join) to the right? That's the one join in this query. Do you see where the two tables feeding it have numbers beside them? Those are the number of rows that each table fed into the loop. 1,285,470 and 1,285,471 ... Well, do you know how a nested loop works? Essentially the formula for the number of iterations is number of rows in table A times number of rows in table B. Quick math problem: what's 1,285,470 times 1,285,471? The answer is 1,652,434,406,370 or 1.652 trillion.

OUCH! SQL Server actually made the incredibly bad decision to choose a plan that iterated 1.652 trillion times! I know this production server's 4-core processor is fast, but no wonder this query was taking upwards of 8 hours.

So why did SQL Server choose this plan? Possibly bad statistics. Here's the relevant portion of the estimated execution plan:



If you look at the right hand table's properties, Estimated Number of Rows = 1. So, 1 times 1 million is not so bad especially for a nested loop. That would have been reasonable assuming there was actually one row in the right hand table, but there wasn't. Since the estimated plan (which is based off statistics) decided there was only one row in the table on the right hand side of the join, the query planner chose a nested loop instead of a hash join.

Normally, I could stop the ETL job, update stats, and then continue on my merry way with the rest of the ETL job. I had no such luck this morning, unfortunately. I ran exec sp_updatestats on both our data warehouse and our staging database, but for some reason I didn't get any better query plans -- perhaps a full statistics update of the affected tables would have worked better, but I didn't have time and I knew a quicker solution. Also, I didn't want to flush the procedure cache (a.k.a., the plan cache) because this is a production server.

How frustrating! SQL Server does this to me all the time, by the way. I wish Microsoft would change the way statistics work: if statistics are not available or are out of date (which is clearly the case here), then do a quick table scan to estimate how many rows are in the table! If it's more than some threshold, don't choose the nested loop! It seems like statistics default to 1 row when they are bad or missing. Some of my tables have 770 million rows in them! Defaulting to 1 row every time a statistic goes bad is catastrophic.

My solution: I'm forcing the query to use a hash join with the hash join query hint: INNER HASH JOIN. Some database administrators might balk at the idea of using query hints, but I say that due to flaws in SQL Server's statistics system and query planner, query hints are an absolute necessity. Incidentally, this is the 3rd time this week that I've been bitten by bad query plans and it's only Wednesday.

Outcome: The query that didn't finish in 8 hours last night with a nested loop join finished in 2 minutes with a forced hash join. I know that I will never have fewer than 1 million rows in either table so I know that the hash join is the only physical join that makes sense, regardless of SQL Server's bad statistics.

Tuesday, August 20, 2013

JOINS and Venn Diagrams

I've been brushing up on probability, set theory, and statistics lately -- there's a lot of need for probability and stats at work and the set theory comes along with certification training and the math review.

Venn diagrams (an example of which is pictured to the left) are a fairly common way to represent sets and are used a lot in set theory -- and not coincidentally -- in database systems. (Relational databases are based on set theory!)

Today while I was helping out on database administrators stack exchange, I found a good link from Coding Horror that shows a bunch representing Venn diagrams of how JOINs work in SQL. The Venn diagrams themselves don't include the mathematical symbols from set theory, but they are useful nonetheless.

Introduction and Welcome!

My Quick Biography

As a professional and academic, I am a data scientist with an interest in the languages, linguistics, cultures and civilizations of ancient civilizations.

As a data scientist, I am well versed in many databases but I am most broadly versed in Microsoft SQL Server, PostgreSQL, and Microsoft Access. I have a lot of interest in data architecture, data analysis, data visualization, big data, and business intelligence. I have been working with databases for about 15 years now but I'm just beginning Microsoft's certification process and aspire to be a Microsoft MVP within 4 years.

As an academic, I have a masters degree in Classical Studies from the University of New Mexico. I am well versed (as there is no such thing as fluency in unspoken languages) in Latin, Ancient Greek, German, and Spanish, and general Linguistics. I have some experience with Sanskrit and Old Irish. I taught Latin and Greek Mythology for four years at the university level and I copy-edited Disce! An Introductory Latin Course (volumes 1 & 2 and the workbooks) for Pearson. I run a free online Latin dictionary called Numen - The Latin Lexicon.

Goal of this Blog

I'm a technologically savvy guy but I've always felt that I consume far more content on the internet than I produce. When I was an instructor at UNM, I received great joy and satisfaction watching people learn new skills. My goal with this blog is to recapture those feelings and to give back to the community of both data scientists and Classicists. I realize that the Venn diagram's intersection between the two is quite small but hopefully I can produce some content that will interest both groups while at the same time I hope that it will elate that intersection of individuals who fall into that small slice. 

I hope to post once per day mostly on the things that I'm learning. More realistically, I intend to post once to twice per week. 

So thanks for stopping by and I hope you'll find your way here again someday!