SyntaxHighlighter

Wednesday, August 28, 2013

A Really Good Definition of Database Selectivity

The Problem

Selectivity in a database is a truly important concept -- fundamental, to some degree. But it's also very abstract, weird, hard to understand, and hard to grasp. Of course, once you grasp it, you've got it forever. Database selectivity is like learning to ride a bicycle for the first time or learning to tie shoelaces or learning to walk even. So why is it that so many tutorials and blogs and websites completely skip over this or -- even worse -- give some arcane technically jargon-filled definition? This article is to help you understand database selectivity from many points of view and with many examples. This article is about one concept and one concept only.

The Definition

Honestly, I'm not even sure a definition is going to be helpful. Words are abstract representations in the first place and using them to define esoteric concepts can cause panic in some learners (like myself). The reason technical definitions are confusing is because words have no inherent meaning on their own -- they, like data in a database, are relational. They are related to memory and conceptual structures in your brain which are connected to the memory and conceptual structures of other words in your brain. If those memories and conceptual structures and connections don't already exist, then you'll have to build them before you can understand a definition.

Look! It's a kitty cat! Meow!
A definition is like a puzzle: if you've got 99% of the picture finished and you are just missing the last piece, great -- you can visualize the big picture and putting the last piece in is easy. 

But what happens when you're first starting out with a bunch of puzzle pieces that have no relationship because they're all jumbled up? Well, you have to spend a lot of time fitting them together, making sure the colors and the shapes line up. It takes time, effort, and perhaps patience to overcome the initial frustration.

Enough of the puzzle metaphor. Instead of one definition, I'm going to give you as many as I can find. 
  1. The degree to which one value can be differentiated within a wider group of similar values.

    This is fairly esoteric. I think this definition makes my head hurt a bit. First of all, what sort of degrees are we talking about? It's a polysemous word. Polysemous is just an adjective that means "a word with many meanings, some of which might be unrelated." Differentiated? Wider group? Without a frame of reference, this one is very confusing.
  2. In SQL, the term selectivity is used when discussing database indexes. The selectivity of a database index is a numeric value that is calculated using a specific formula. That formula actually uses the cardinality value to calculate the selectivity. This means that the selectivity is calculated using the cardinality – so the terms selectivity and cardinality are very much related to each other. Here is the formula used to calculated selectivity: Selectivity of index = cardinality/(number of rows) * 100%

    I actually like this definition quite a bit because it's a mathematical formula. Unfortunately, it requires an understanding of cardinality prior to understanding selectivity. That makes it difficult if you don't have a grasp on cardinality, which is another fundamental concept in databases.
  3. Index selectivity is a number the determines the effectiveness of index. Best possible selectivity is when all table records have a different value for the columns in index (this is typical for primary keys and unique constraints).

    This is another highly technical definition. You need to know all kinds of concepts before you can get anything out of this one. Not only that, but it's badly phrased and written in a non-standard dialect of English. 
I could probably go on for quite some time with definitions from around the interwebs, but frankly I'm tired already. Google has failed me! Either that, or the definitions out there kinda suck.

So here's my definition of database selectivity: In any given set of data, selectivity refers to the percentage of how many unique values exist compared to how many pieces of data exist. 

Here's another way to say it: Selectivity is the ratio between unique values and the number of pieces of data (in any given data set).  

Finally, here's the formula without using other terms you might not be familiar with yet:
Selectivity = number of unique (or distinct) values / total number of values 

Attributes of Selectivity

It really refers to a number, sometimes expressed as a ratio or a percentage. This number is produced based on a formula. The formula is really simple and you should memorize it thoroughly.

High selectivity means the data have a lot of unique values and the number is closer to 1 or 100%. The adjective selective refers to high selectivity in a set of data. You might also see someone write about data being very selective

Low selectivity means that the data have very few unique values and the number is closer to 0 or 0%. 

Examples


Example Using a Table

It might be helpful to understand what unique or distinct values are. If you examine the table on the right, you'll see three columns worth of data. A child's name, what kind of pet he or she has, and the town he or she lives in. 

How many different children are there? There are 8 different children but there are only 7 names. Notice how John is repeated. It doesn't mean they are they same child, just that they have the same name. Again there are 8 children, but only 7 distinct names.

How many different kinds of pets are there? There are 3 kinds of pets: Dog, Cat, Fish. However, there are 8 pets total, they just happen to be the same type. In other words, there are 8 pets and only 3 distinct types of pets. 

How many different towns are there? In this case, there is only one town: Boise. This problem is a bit different: there is only one town, but there are 8 rows of data.  Selectivity refers to the number of rows or values, not to the number of objects being represented. There is 1 distinct town name listed here, but there are 8 rows. 

Using the definitions and information provided above, can you answer these three questions?
  1. Is the child's name column highly or lowly selective and why?
  2. Is the pet's type column highly or lowly selective and why?
  3. Is the town's name column highly or lowly selective and why? 
Answers: 
  1. Highly selective -- because the percentage is closer to 100% (percentage in these cases referring to the number of distinct values divided by the total numbers of values in a column).
  2. Somewhat selective -- because the percentage is closer to the middle or 50%.
  3. Lowly or not selective -- because the percentage is closer to 0%. 
Are you starting to see the kind of relationship that defines selectivity? 

Example Using Circles

To the left we have two circles.

The big circle represents all your values.

The little circle represents only the distinct or unique values.

In this particular circle, what is the ratio or percentage of distinct values to all your values? Is it highly selective or lowly selective? Is it closer to 100% or closer to 0%? This diagram represents low selectivity because the percentage is closer to 0%

In the next set of circles, what is the ratio or percentage? Is this highly or lowly selective? This diagram represents high selectivity because the percentage is closer to 100%.

Example using a Database Query

I'm going to give you a query that you can use to figure out the selectivity of a column. Just replace the column name and the table name with any column or table in your database.

SELECT 
    COUNT( DISTINCT ColName ) AS DistinctValues,
    COUNT( ColName ) AS TotalNumberOfValues,
    COUNT( DISTINCT ColName ) / COUNT( ColName ) * 100.0 AS Selectivity
FROM 
    TableName;

I ran this query on two different columns on the same table in my database. On the primary key (which by its very nature has 100% distinct values) and on a date column.

Looking at the results, you can see that one has a very high selectivity (100%) and the other has very low selectivity (0%). Can you explain that? If you can, then you are probably getting the hang of selectivity.

Here's another question: which one of the results represents the primary key of the table and why?

Why Does it Matter?

The reason any of this matters is because understanding the selectivity of your columns and being able to calculate it will make your ability to build indexes much more fact- and information-based. You will no longer simply throw an index on a table because you hope it will help the performance. You will use good judgement and knowledge. (Use the force, Luke!)

The fact of the matter is that the higher the selectivity on an index (or in other words, the more unique values it has), the better that index will work for filtering data and joining tables. The reason that's true is the subject of another article altogether, so I will leave that for another day. In the meantime, try to understand selectivity and return to this article and the many more advanced articles on the interwebs to try to deeply ingrain this concept in your databasing soul! 


Tuesday, August 27, 2013

SSRS and the Fountain of Useless Errors: Conversion from type 'CalculatedFieldWrapperImpl' to type 'Integer' is not valid

Hey, I'm no novice. I've been doing this computerish stuff for a good long while. But Microsoft SQL Server Reporting Services (SSRS) stumps me on a regular basis. I use BIDS (Business Intelligence Design Studio) and SSDT (SQL Server Data Tools) on a regular basis -- weekly, sometimes daily. It seems like it gives the worst, most cryptic, most useless error messages of all time. Well, besides the error messages that come from Linux, of course. 

Here's the doozie I got today:

The Sort expression for the grouping 'FUBAR' contains an error: Conversion from type 'CalculatedFieldWrapperImpl' to type 'Integer' is not valid.


So what had happened was this ... I had built a query (a view actually) in SQL Server and added it to the report and used it in a matrix. Then I decided I needed more columns, so I modified the view and then modified the data set. Then I tried to use the new columns to sort a group. When I tried to preview, I got this gloriously useless message.

I tried a lot. Googling, Binging, whatever else one does in these situations. I took a coffee break and thought about it. I then realized that perhaps since I changed the source of the data that the data cache had gone bad. So I found the file report.rdl.data in my project folder and deleted it. 

That did the trick.

Remember kids, anytime you change the underlying data source, delete the cache from BIDS or SSDT. It will cure a huge number of problems in SSRS.

Thursday, August 22, 2013

SQL Server Database Mail: Make sure Database Mail Procedures, Profiles, and Operators Exist

If you'd like to send an email from SQL Server, you can do it using database mail, presuming that database mail is installed.

If you'd like to send a notification to an operator from a stored procedure or any other T-SQL code, you should use the procedure sp_notify_operators.

But what happens if database mail isn't installed? You will get an error that the procedure doesn't exist. You could try to catch that error with a BEGIN TRY...BEGIN CATCH block, but you could also test to see if the procedure exists.
 
 IF EXISTS
 (
  SELECT
       *
  FROM
       msdb.sys.objects
  WHERE
       type = 'P'
       AND name = 'sp_notify_operator' 
 )
 BEGIN

  -- do something here, such as calling the procedure

 END

You could also check to see if the email profile or the operator exists.

 -- Make sure the profile exists
 IF NOT EXISTS
 ( 
  SELECT
      *
  FROM
   msdb.dbo.sysmail_profile
  WHERE
   name = @email_profile_name
 )
 BEGIN
 
  RAISERROR( 'That database mail profile does not exist.', 16, 1 );
  RETURN;

 END

 -- Make sure the operator exists
 IF NOT EXISTS
 ( 
  SELECT
      *
  FROM
      msdb.dbo.sysoperators
  WHERE
   NAME = @operator_name 
 )
 BEGIN
 
  RAISERROR( 'That operator name does not exist.', 16, 1 );
  RETURN; 

 END

Wednesday, August 21, 2013

SQL Server and the Quest for the Intermediate Materialization Query Hint

The Adventure

I feel like I'm Indiana Jones and I'm hunting for a lost treasure: intermediate materialization. It seems like I do that a lot in SQL Server -- I'm probably on my 5th or 6th sequel by now (get it? SQL? SEQUEL? sequel?). Most of the time there is some kind of awful kluge to workaround the SQL Server team's stubborness. They wrote SQL Server that way, and it's just fine the way it is!

An example that's unrelated: paging results. In the old days, paging results was a nightmare that had to be done with the TOP keyword, sorting, and several subqueries. I shudder for those days. Then the window functions came and ROW_NUMBER() saved the day. Finally, in SQL Server 2012, Microsoft's dev team gave in and we got the OFFSET-FETCH keywords. (Funny, since MySQL and PostgreSQL have been making our lives easier with such keywords as LIMIT and OFFSET for ... decades?)

But one thing that doesn't seem to be on the table yet is (dun dun duuuun!) intermediate materialization. Sure, Microsoft has said they'll [try to think about maybe] support a query hint in the future for intermediate materialization, but that request was made in 2006 (!!!!!) and still hasn't arrived yet.

The Problem

Let's say you're diving into execution plans pretty deeply. You find there are some problems inherent to SQL Server's query optimizer -- it just doesn't do what you want it to do. 

You find that you can run some part of your query in a new connection window (that will later become a subquery). It runs in 3 seconds instead of 90 minutes. Voila! You have the answer. All you have to do now is to integrate the subquery into the main query. 

The problem starts after you stick your standalone query into a subquery in your view and you join it to the main query. Then you run it and you realize that it's now running at 90 minutes again! How frustrating, but why?

The reason, you discover after analyzing the query plan, is because the query optimizer has rewritten your subquery into a series of joins, undoing all the hard work you did by creating a subquery in the first place (and making your query run 10000% longer in the process).

You know there are such things as query hints, so you go off looking for a query hint to solve this problem. The problem is, you don't know what this problem is called. You search for subquery optimization in SQL Server and stop SQL Server from optimizing my subquery and SQL Server is rewriting subqueries and on and on until your fingers get numb. Then you discover that subqueries are sometimes called virtual queries so you search for SQL Server rewrites virtual queries. At some point you stumble upon some articles relating to materialized queries and you remember that you've heard those words before but you don't know what they mean because your database professor covered that topic on the Monday morning after the big football game. 

What is Materialization?

In databases, materialization is the process of gathering data from physical sources and creating results in the form of a relation (a table). It contains a snapshot of data necessary at that moment in time. Any query or view that produces results is referred to as materialized. Wikipedia has a good explanation of a materialized view.

It seems abstract now, but it's really a simple concept. It's like arrays or objects. Eventually you'll just get it. I suggest you try really hard to understand it now so that the learning will come faster because this is a really useful concept in your database life. It's like playing Super Mario Brothers. You just couldn't get past level 8-4 the night you tried for 6 hours, but the next morning after a good night's sleep, you got through it in one try.

What is Intermediate Materialization? 

Intermediate materialization is when query results must be stored temporarily before processing the remainder of the query. In other words, SQL Server has to stop joining pre-existing data sources, create a temporary table, and then join to that temporary table. 

Let me ask you a question? Have you ever created a temporary table from a SELECT statement and then joined that temporary table to another table? If your answer is yes, then you have just carried out the poor man's version of intermediate materialization

The problem is that SQL Server does not support intermediate materialization ... And in fact, if other bloggers are to believed, the SQL Server dev team is actively trying to make it harder for us to use intermediate results in a single query. [aside: I have a theory that the reason Microsoft doesn't want to add this query hint is because it would create a non-relational phase in the logical processing steps of executing queries.]

Solution

The old fashioned way to do this was to use TOP 100 PERCENT but evidently that went away in SQL Server 2005. As Adam Machanic suggests, using TOP ( 2147483647 ) but that was for SQL Server 2005, and who knows if it works in 2008 R2 or 2012?

Update 2013.08.28: Yes, it does seem to work in 2008 R2. Your mileage may vary. 

Another user on stackoverflow suggests using the CASE statement (and might I suggest that you should probably use it on your join keys) and I've had some luck with that in the past. Keep in mind, however, that if you're joining millions of rows that each one of them has to be cast and processed as an expression and that could slow your query down in its own right.

Example

I had a query that was taking 80 minutes. I tracked its duration over time and it was slowly increasing. I optimized the subquery so that it took 5 minutes but in the entire query the subquery was being optimized as a series of joins. After tricking SQL Server into intermediate materialization, I was able to produce the query results consistently in about 6 minutes. This chart demonstrates the massive savings I was able to get by using intermediate materialization. Your mileage may vary!

Query Plans Gone Bad: The Case of the Nested Loop and the Trillion Operations

At work, we have at least three huge ETL (extract, transform, load) jobs. They usually run within a normal 2-3 hour time frame. Yet every once in a while there's a crazy spike and those spikes ruin my day because the ETL jobs are laughably complex and there are downstream reports. Plus these ETL jobs run daily, so they can't pile up ...

What's causing those spikes? The short answer is bad execution plans and bad statistics.

You see, a query (a very simple INSERT query with a single joined table) was holding up one ETL job . The query normally runs in two minutes, but this time it had been running for 8 hours overnight. I started getting "over average" alerts around 6 am. To the left you can see that the 4-core CPU had been pegged until I killed the query. There was almost no disk I/O and there was nothing unusual about the memory or the network. Just a solidly pegged CPU for 8 hours. Boy, that CPU must have been hot.

So I checked out the query plan in the procedure cache. Below you can see the result:



Do you see where it shows a Nested Loops (Inner Join) to the right? That's the one join in this query. Do you see where the two tables feeding it have numbers beside them? Those are the number of rows that each table fed into the loop. 1,285,470 and 1,285,471 ... Well, do you know how a nested loop works? Essentially the formula for the number of iterations is number of rows in table A times number of rows in table B. Quick math problem: what's 1,285,470 times 1,285,471? The answer is 1,652,434,406,370 or 1.652 trillion.

OUCH! SQL Server actually made the incredibly bad decision to choose a plan that iterated 1.652 trillion times! I know this production server's 4-core processor is fast, but no wonder this query was taking upwards of 8 hours.

So why did SQL Server choose this plan? Possibly bad statistics. Here's the relevant portion of the estimated execution plan:



If you look at the right hand table's properties, Estimated Number of Rows = 1. So, 1 times 1 million is not so bad especially for a nested loop. That would have been reasonable assuming there was actually one row in the right hand table, but there wasn't. Since the estimated plan (which is based off statistics) decided there was only one row in the table on the right hand side of the join, the query planner chose a nested loop instead of a hash join.

Normally, I could stop the ETL job, update stats, and then continue on my merry way with the rest of the ETL job. I had no such luck this morning, unfortunately. I ran exec sp_updatestats on both our data warehouse and our staging database, but for some reason I didn't get any better query plans -- perhaps a full statistics update of the affected tables would have worked better, but I didn't have time and I knew a quicker solution. Also, I didn't want to flush the procedure cache (a.k.a., the plan cache) because this is a production server.

How frustrating! SQL Server does this to me all the time, by the way. I wish Microsoft would change the way statistics work: if statistics are not available or are out of date (which is clearly the case here), then do a quick table scan to estimate how many rows are in the table! If it's more than some threshold, don't choose the nested loop! It seems like statistics default to 1 row when they are bad or missing. Some of my tables have 770 million rows in them! Defaulting to 1 row every time a statistic goes bad is catastrophic.

My solution: I'm forcing the query to use a hash join with the hash join query hint: INNER HASH JOIN. Some database administrators might balk at the idea of using query hints, but I say that due to flaws in SQL Server's statistics system and query planner, query hints are an absolute necessity. Incidentally, this is the 3rd time this week that I've been bitten by bad query plans and it's only Wednesday.

Outcome: The query that didn't finish in 8 hours last night with a nested loop join finished in 2 minutes with a forced hash join. I know that I will never have fewer than 1 million rows in either table so I know that the hash join is the only physical join that makes sense, regardless of SQL Server's bad statistics.

Tuesday, August 20, 2013

JOINS and Venn Diagrams

I've been brushing up on probability, set theory, and statistics lately -- there's a lot of need for probability and stats at work and the set theory comes along with certification training and the math review.

Venn diagrams (an example of which is pictured to the left) are a fairly common way to represent sets and are used a lot in set theory -- and not coincidentally -- in database systems. (Relational databases are based on set theory!)

Today while I was helping out on database administrators stack exchange, I found a good link from Coding Horror that shows a bunch representing Venn diagrams of how JOINs work in SQL. The Venn diagrams themselves don't include the mathematical symbols from set theory, but they are useful nonetheless.

Introduction and Welcome!

My Quick Biography

As a professional and academic, I am a data scientist with an interest in the languages, linguistics, cultures and civilizations of ancient civilizations.

As a data scientist, I am well versed in many databases but I am most broadly versed in Microsoft SQL Server, PostgreSQL, and Microsoft Access. I have a lot of interest in data architecture, data analysis, data visualization, big data, and business intelligence. I have been working with databases for about 15 years now but I'm just beginning Microsoft's certification process and aspire to be a Microsoft MVP within 4 years.

As an academic, I have a masters degree in Classical Studies from the University of New Mexico. I am well versed (as there is no such thing as fluency in unspoken languages) in Latin, Ancient Greek, German, and Spanish, and general Linguistics. I have some experience with Sanskrit and Old Irish. I taught Latin and Greek Mythology for four years at the university level and I copy-edited Disce! An Introductory Latin Course (volumes 1 & 2 and the workbooks) for Pearson. I run a free online Latin dictionary called Numen - The Latin Lexicon.

Goal of this Blog

I'm a technologically savvy guy but I've always felt that I consume far more content on the internet than I produce. When I was an instructor at UNM, I received great joy and satisfaction watching people learn new skills. My goal with this blog is to recapture those feelings and to give back to the community of both data scientists and Classicists. I realize that the Venn diagram's intersection between the two is quite small but hopefully I can produce some content that will interest both groups while at the same time I hope that it will elate that intersection of individuals who fall into that small slice. 

I hope to post once per day mostly on the things that I'm learning. More realistically, I intend to post once to twice per week. 

So thanks for stopping by and I hope you'll find your way here again someday!