SyntaxHighlighter

Wednesday, August 28, 2013

A Really Good Definition of Database Selectivity

The Problem

Selectivity in a database is a truly important concept -- fundamental, to some degree. But it's also very abstract, weird, hard to understand, and hard to grasp. Of course, once you grasp it, you've got it forever. Database selectivity is like learning to ride a bicycle for the first time or learning to tie shoelaces or learning to walk even. So why is it that so many tutorials and blogs and websites completely skip over this or -- even worse -- give some arcane technically jargon-filled definition? This article is to help you understand database selectivity from many points of view and with many examples. This article is about one concept and one concept only.

The Definition

Honestly, I'm not even sure a definition is going to be helpful. Words are abstract representations in the first place and using them to define esoteric concepts can cause panic in some learners (like myself). The reason technical definitions are confusing is because words have no inherent meaning on their own -- they, like data in a database, are relational. They are related to memory and conceptual structures in your brain which are connected to the memory and conceptual structures of other words in your brain. If those memories and conceptual structures and connections don't already exist, then you'll have to build them before you can understand a definition.

Look! It's a kitty cat! Meow!
A definition is like a puzzle: if you've got 99% of the picture finished and you are just missing the last piece, great -- you can visualize the big picture and putting the last piece in is easy. 

But what happens when you're first starting out with a bunch of puzzle pieces that have no relationship because they're all jumbled up? Well, you have to spend a lot of time fitting them together, making sure the colors and the shapes line up. It takes time, effort, and perhaps patience to overcome the initial frustration.

Enough of the puzzle metaphor. Instead of one definition, I'm going to give you as many as I can find. 
  1. The degree to which one value can be differentiated within a wider group of similar values.

    This is fairly esoteric. I think this definition makes my head hurt a bit. First of all, what sort of degrees are we talking about? It's a polysemous word. Polysemous is just an adjective that means "a word with many meanings, some of which might be unrelated." Differentiated? Wider group? Without a frame of reference, this one is very confusing.
  2. In SQL, the term selectivity is used when discussing database indexes. The selectivity of a database index is a numeric value that is calculated using a specific formula. That formula actually uses the cardinality value to calculate the selectivity. This means that the selectivity is calculated using the cardinality – so the terms selectivity and cardinality are very much related to each other. Here is the formula used to calculated selectivity: Selectivity of index = cardinality/(number of rows) * 100%

    I actually like this definition quite a bit because it's a mathematical formula. Unfortunately, it requires an understanding of cardinality prior to understanding selectivity. That makes it difficult if you don't have a grasp on cardinality, which is another fundamental concept in databases.
  3. Index selectivity is a number the determines the effectiveness of index. Best possible selectivity is when all table records have a different value for the columns in index (this is typical for primary keys and unique constraints).

    This is another highly technical definition. You need to know all kinds of concepts before you can get anything out of this one. Not only that, but it's badly phrased and written in a non-standard dialect of English. 
I could probably go on for quite some time with definitions from around the interwebs, but frankly I'm tired already. Google has failed me! Either that, or the definitions out there kinda suck.

So here's my definition of database selectivity: In any given set of data, selectivity refers to the percentage of how many unique values exist compared to how many pieces of data exist. 

Here's another way to say it: Selectivity is the ratio between unique values and the number of pieces of data (in any given data set).  

Finally, here's the formula without using other terms you might not be familiar with yet:
Selectivity = number of unique (or distinct) values / total number of values 

Attributes of Selectivity

It really refers to a number, sometimes expressed as a ratio or a percentage. This number is produced based on a formula. The formula is really simple and you should memorize it thoroughly.

High selectivity means the data have a lot of unique values and the number is closer to 1 or 100%. The adjective selective refers to high selectivity in a set of data. You might also see someone write about data being very selective

Low selectivity means that the data have very few unique values and the number is closer to 0 or 0%. 

Examples


Example Using a Table

It might be helpful to understand what unique or distinct values are. If you examine the table on the right, you'll see three columns worth of data. A child's name, what kind of pet he or she has, and the town he or she lives in. 

How many different children are there? There are 8 different children but there are only 7 names. Notice how John is repeated. It doesn't mean they are they same child, just that they have the same name. Again there are 8 children, but only 7 distinct names.

How many different kinds of pets are there? There are 3 kinds of pets: Dog, Cat, Fish. However, there are 8 pets total, they just happen to be the same type. In other words, there are 8 pets and only 3 distinct types of pets. 

How many different towns are there? In this case, there is only one town: Boise. This problem is a bit different: there is only one town, but there are 8 rows of data.  Selectivity refers to the number of rows or values, not to the number of objects being represented. There is 1 distinct town name listed here, but there are 8 rows. 

Using the definitions and information provided above, can you answer these three questions?
  1. Is the child's name column highly or lowly selective and why?
  2. Is the pet's type column highly or lowly selective and why?
  3. Is the town's name column highly or lowly selective and why? 
Answers: 
  1. Highly selective -- because the percentage is closer to 100% (percentage in these cases referring to the number of distinct values divided by the total numbers of values in a column).
  2. Somewhat selective -- because the percentage is closer to the middle or 50%.
  3. Lowly or not selective -- because the percentage is closer to 0%. 
Are you starting to see the kind of relationship that defines selectivity? 

Example Using Circles

To the left we have two circles.

The big circle represents all your values.

The little circle represents only the distinct or unique values.

In this particular circle, what is the ratio or percentage of distinct values to all your values? Is it highly selective or lowly selective? Is it closer to 100% or closer to 0%? This diagram represents low selectivity because the percentage is closer to 0%

In the next set of circles, what is the ratio or percentage? Is this highly or lowly selective? This diagram represents high selectivity because the percentage is closer to 100%.

Example using a Database Query

I'm going to give you a query that you can use to figure out the selectivity of a column. Just replace the column name and the table name with any column or table in your database.

SELECT 
    COUNT( DISTINCT ColName ) AS DistinctValues,
    COUNT( ColName ) AS TotalNumberOfValues,
    COUNT( DISTINCT ColName ) / COUNT( ColName ) * 100.0 AS Selectivity
FROM 
    TableName;

I ran this query on two different columns on the same table in my database. On the primary key (which by its very nature has 100% distinct values) and on a date column.

Looking at the results, you can see that one has a very high selectivity (100%) and the other has very low selectivity (0%). Can you explain that? If you can, then you are probably getting the hang of selectivity.

Here's another question: which one of the results represents the primary key of the table and why?

Why Does it Matter?

The reason any of this matters is because understanding the selectivity of your columns and being able to calculate it will make your ability to build indexes much more fact- and information-based. You will no longer simply throw an index on a table because you hope it will help the performance. You will use good judgement and knowledge. (Use the force, Luke!)

The fact of the matter is that the higher the selectivity on an index (or in other words, the more unique values it has), the better that index will work for filtering data and joining tables. The reason that's true is the subject of another article altogether, so I will leave that for another day. In the meantime, try to understand selectivity and return to this article and the many more advanced articles on the interwebs to try to deeply ingrain this concept in your databasing soul! 


No comments:

Post a Comment