Databases and Dead Languages

As a professional, I am a data scientist with an interest in the languages, linguistics, and cultures of ancient civilizations. I primarily work with Microsoft SQL Server, Access, Excel, and PostgreSQL but I have 15 years of experience with databases. I am interested in data analysis, visualization, big data, and business intelligence. I have a master's degree in Classical Studies from UNM where I taught Latin for four years. I run an online dictionary called "Numen - The Latin Lexicon."

Tuesday, December 31, 2013

My New Years Resolution: Learn About Massively Distributed Database Systems

Database scalability seems to be a problem for many database administrators and data managers. The problem is not "how do I do it" or "what is it" but rather "what do I do with it?" I've worked with the smallest databases in many different traditional RDBMS platforms and I've worked with some pretty huge databases. My last job had several terabyte databases sitting in a single database server -- no clustering. I currently work in an environment with several small databases sitting on an over-sized cluster. Frankly, I think they have it backwards. The big environment should have been clustered and the small environment doesn't need to be. But I digress. The point is, I don't think most database architects know what to do with scalable systems because, frankly, they don't have enough data in their small to medium shops. But that will change -- and soon!

In order to grow and to perhaps work in a bigger shop that has the need for massively distributed data, we have to think bigger and learn about the world outside of our small ponds. Not only that, but our small ponds will be getting bigger. The data landscape is changing quickly.

I'm thinking about this today because I read an article about InfiniSQL (actually, more of a press release) from its principal developer. Some of the concepts didn't make a whole lot of sense to me because I've never worked in an environment that distributes data to multiple geographical locations. Thinking back on the articles I've read recently, it seems like distribution of data is going to be the big problem to solve in the future and it seems like an important niche to dive into. After all, last year's contender for word-of-the-year was Big Data.

But Big Data is conceptual in nature and very few people understand what it really is. Big Data is not a terabyte or two terabytes in a database cluster -- not even ten or twenty or fifty or a hundred. Big Data is more like a hundred terabytes per day. The actual storage systems are dozens to hundreds of petabytes in size. Big Data was actualized by massive players -- Yahoo and Google and maybe even Pixar. I found this out by reading Big Data: A Revolution That Will Transform How We Live, Work, and Think.

What I would like to talk about here is not Big Data. Big Data is going to get bigger -- that's true. Next decade's databases will probably be 10-100 times bigger than the ones we have now, even for small to medium sized firms. Since storage is so cheap we've all begun to hoard data. But in its most fundamental format data -- and especially Big Data -- is less than useless!

The time for some is now -- and the time for the rest of us is soon -- when we will actually have to start leveraging our treasure mounds of data for something beyond simple reports. The new buzzwords are Data Visualization, Data Mining, and the like. We don't just need to store the data, we need to extract useful information from it.

Sure, Data Visualization and Data Mining have been around for a while as concepts but they haven't really entered the lexicon of the common database administrator. The reason is because DV and DM have been so difficult to accomplish without a degree in higher mathematics -- seriously! Have you ever wanted to compare the graphs of two time periods to detect their "sameness" or "differentness?" There's a reason we draw graphs and give them to analysts to interpret -- because the human brain can do so much more with two simple graphs today than a team of computer scientists. The algorithms and mathematics are out there, for the most part, but where data, computer algorithms, and higher mathematics intersect is the trivium (the crossroads) where only a few have been able to tread successfully.

Those people, with that trinity of accessible skills, are the ones who make $400-$500k annually as developers. They make so much money and live such charmed lives because conceptually it's such an exquisitely difficult nirvana to reach. It's hard enough learning how to program, and even harder to understand the intricacies of data, and harder still to master the highest orders of mathematics. Put them together and you have a requirement of 20 years of intense college education peppered with summers of internships and late and lonely nights of coding and experimentation. Unfortunately, twenty years in this industry is a lifetime and everything will have changed in 20 years.

That's why it's important to keep up with the current trends in the data sciences. Sure, there are a few geniuses here and there who have the right mix of skills to command $500k (in a non-executive role) but they are far and few between. For the rest of the world, there are the data scientists who will earn $150k, $125k, $100k, $75k per year. They are not the ones who develop the database software or the analytics software necessarily (albeit there will be some small-scale development for those who can push themselves). Rather, they will be the ones who learn to listen to the tracks to hear the oncoming trains. In other words, they will be the ones who learn to leverage existing software and push their small shops to the bleeding edge by testing prototype systems in strange and new combinations to see what works and what doesn't.

So here's my New Year's resolution. 2014 will be the year in which I push my skills into new territories. I don't just want to read about Big Data, Data Visualization, Data Mining, Statistical Analysis, Distributed Systems, etc. I want to learn their intricacies and nuances and I want to listen to what the data science community is saying about them. I want to learn about horizontal scaling, vertical scaling, large scale clustering, and anything having to do with storing and mining that data for useful information. Perhaps next year's resolution will be to come up with a problem to solve. Until then!

Friday, December 20, 2013

CONVERT_IMPLICIT in a JOIN or Subquery

Implicit conversion can be a big problem in SQL Server that can slow down your queries by orders of magnitude -- even with relatively low row counts.

First of all, what is implicit conversion? When two values are compared in an expression, the data types should match. If they don't match, SQL Server will implicitly (without being told) convert one data type into the other so that they match. After they match, then SQL Server can make the comparison.

Implicit conversion can be identified and corrected in many cases and it's most easily done in the query itself with explicit CAST and CONVERT statements. But what happens when joining two tables together with keys of different data types?

When sets are overlapped (with a JOIN or with a subquery in an IN operator), each set has a key which correlates them together. If those keys are the same data type (but not simply compatible data types), then indexes which exist on the keys can be directly used to speed up the correlation. However, if the two data types are not the same, SQL Server must implicitly convert the data types and it can no longer use those indexes. So basically, all that work you did to make those indexes finely tuned and optimized gets thrown out the window because they will never be used.

Look at this snippet of an execution plan:

Above, an implicit conversion is happening inside the COMPUTE SCALAR nodes in this execution plan. You can see that almost half a million rows are being scanned from Table3. Notice that the compile time is almost 4 seconds. Client statistics of 10 runs revealed that this query takes an average of 2892ms to run.

COMPUTE SCALAR is actually converting nvarchar(15) in Table3 to char(10), and it's doing it 500,000 times! Is all that work really necessary?! Not at all.

Look what happens when I change the data type of the base table Table3 to char(10). Note that SQL Server will complain about this conversion because it could result in a loss of data, but in my case, I verified that it would not have.

Above, first notice that the anonymizer has changed Table3 to Table4. Furthermore, there is no implicit conversion and the indexes can and in fact are being seeked. Instead of scanning 500,000 rows, you now see it's seeking 112 and 7. A tiny change has radically altered the dynamics of this query. Notice other stats, like the fact that it ran in almost 1/40th (3%!) of the time. Client statistics confirm the speedup.

Your mileage may vary, of course, but always look out for CONVERT_IMPLICIT!

Thursday, December 5, 2013

Adding Existing Queries and Miscellaneous Files to Source Control in SSMS with MSSCCI

Wow, that title is a mouthful, isn't it? It makes me cringe a little bit that I had to write it!

So, you know what SSMS is, right? SQL Server Management Studio. That's where you write most of your queries. If you've used this product extensively you know that managing your queries can be painful -- you probably rename your query 50 times with some combination of the date or keywords or something similar.

MyQuery.sql
MyQuery 2013.12.05.sql
MyQuery By Week.sql

But if you use TFS (Microsoft's Team Foundation Server), you can quite happily add query projects to source control just like any other development project.

This post is not about how to set up your source control -- There is a great blog post called Use Team Foundation Server (TFS) as your Source Control in SSMS by Dan Thompson that explains it quite well. It uses an interface called MSSCCI (I'm guessing Microsoft Source Code Control Interface?)

This post is actually about one major oversight in the interface: you can't add existing items, such as SQL files from other locations or query result files that you'd like to keep track of. If you right click on Queries or Miscellaneous you'll notice that you only have the option to create new items. Bummer, right?

There is a solution. There are actually two. You can either edit the project files manually (not described here) or you can take the easy way out and simply drag and drop your files from a Windows Explorer window.

Awesome, right? Now you have a fairly complete source control solution and you can manage your SQL files and the related miscellaneous files!

Tuesday, December 3, 2013

Transform a Date Range into a Time Series

It's been a few weeks since I last posted anything because I've relocated to a new company. The transition was smooth but I haven't had time to post. Now, however, after having dived right in, I'm back up and running and writing big queries for big reports.

Yesterday I received a data request asking for a time series dataset, but our management system only stores data in ranges. For any given range of time X, they wanted to see the data for that time period broken into a time series -- in this case, by month. Conceptually, it's similar to an UNPIVOT statement in that it's taking two columns (a start and an end date) and turning it into X number of months as rows.

Here's an example of what the range looks like (table A):

BEGINDATE	ENDDATE	DATA
2013-01-01	2013-04-04	8

And here's what the output should look like (table Z):

FIRSTDAYOFMONTH	DATA
2013-01-01	8
2013-02-01	8
2013-03-01	8
2013-04-01	8

Brilliant. So the result looks somewhat like a table that has been unpivoted, but after careful analysis, it's not the same transformation. What really has to happen is that, given the number of months in the date range, that's how many rows need to be created.

There are a number of facilities for performing this action in T-SQL, given that T-SQL has many iterative features (cursors, while loops, et cetera) so this problem could easily be solved that way. However, it can also be solved in a single query given that SQL Server includes a number of set-based features that can perform this task. You have to understand the problem and use the right tool for the job.

What is the right tool for the job? Well, after more careful analysis, we realize that the core of this problem is creating a variable number of rows in output. Are there any tools in SQL server that will do such a thing? Yes! JOIN keywords are perfectly capable of turning one row into many in a one-to-many relationship. On the left hand side of the join we need a "one" table and in the right hand side of the join we need a "many" table. We have the "one" table -- table A. But we need the "many" table -- table B.

There are several ways to create table B, but the way I prefer is to simply use the ROW_NUMBER window function. Just somehow manage to get a bunch of rows, then number them. Here's a fairly elegant way to do this:

 
 SELECT TOP(1000)
  ROW_NUMBER() OVER( ORDER BY ( SELECT 0 ) ) - 1 AS NUMBER
 FROM
  sys.columns a
 CROSS JOIN 
  sys.columns b
 CROSS JOIN 
  sys.columns c
 CROSS JOIN 
  sys.columns d

It works by creating a huge number of rows and then numbering them. In other databases, such as PostgreSQL, you could theoretically use a function like generate_series(). Or you could create a table valued function in SQL Server that does the same thing. It makes no difference. What does make a difference, however, is the TOP(1000) qualifier. This should be the maximum number of time units you ever expect to see. 1000 months is about 83 years, but 1000 seconds is about 17 minutes.

So now we have table B. All you have to do now is join them together. But wait, table B returns a bunch of numbers, not months!

The trick, then, is to join on numbers, not months. Does SQL Server have a function for converting months to numbers? In fact it does. DATEDIFF will handle this quite easily.

 SELECT 
  DATEDIFF( MONTH, '2013-01-01', '2013-04-01' )

The result, as you would expect, is 3. There are three months between January and April (and there would be 0 months between January and January of the same year).

So now, you can join on numbers (pseudo-SQL below):

 
 ...

 FROM
  DATA D
 JOIN
  NUMBERS N
 ON
  N.NUMBER = DATEDIFF( MONTH, '2013-01-01', '2013-04-01' )

The problem here, of course, is that only the unique values join together and you end up with one row, so what you ultimately want is a range join, meaning the join can occur on a range of values, in this case, any number between 0 and DATEDIFF(...). Since 0 is implied because ROW_NUMBER starts at 1 (and we subtracted 1 leaving 0 as the starting point).

 
 ...

 FROM
  DATA D
 JOIN
  NUMBERS N
 ON
  N.NUMBER < DATEDIFF( MONTH, '2013-01-01', '2013-04-01' )

So now you understand the concept if I've explained myself correctly. You can composite the joins however you want. I prefer to use a CTE to make the code look cleaner, but you could just as easily use any form of derived query (CTE, subquery, table valued function). Here's my final product:

-- create a CTE (common table expression - a "derived query") for the numbers table
-- NOTE: the numbers table will never change unless you need more than 1000 units of time 
-- (i.e., 1000 months) 
WITH 
 NUMBERS
AS
(
 SELECT TOP(1000)
  ROW_NUMBER() OVER( ORDER BY ( SELECT 0 ) ) - 1 AS number
 FROM
  sys.columns a
 CROSS JOIN 
  sys.columns b
 CROSS JOIN 
  sys.columns c
 CROSS JOIN 
  sys.columns d 
),

-- create a CTE (common table expression - a "derived query") for the data
 DATA
AS
(
 SELECT
  '2013-01-01' AS BEGINDATE,
  '2013-04-01' AS ENDDATE,
  8 AS DATA          
)

-- work with the derived queries as if they were tables (whose aliases are NUMBERS and DATA)
SELECT

 -- Get the first day of the month with a convoluted calculation -- this would be much easier
 -- in SQL Server 2012 because there is an EOMONTH function
 CAST( DATEADD( MONTH, N.NUMBER, D.BEGINDATE ) AS DATE ) AS FIRSTDAYOFMONTH,
 D.DATA
FROM
 DATA D
JOIN   
 NUMBERS N

-- this is where the magic happens ... join on the numbers table which is simply a sequence
-- of numbers from 0-999 with the difference between the dates in terms of months.
-- So, the difference between January and March is (3-1) = 2, so the join will occur 
-- between 0 and 2, that is 0, 1, and 2 -- thereby creating 3 rows, one for each month
ON 
 N.NUMBER <= DATEDIFF( MONTH, D.BEGINDATE, D.ENDDATE )

-- this is just limiting the query so that you won't see months after the current
-- month; remove this for projections into the future
WHERE
 D.ENDDATE < GETDATE()
ORDER  BY
 FIRSTDAYOFMONTH

Update: Some ideas for this post came from SQL how to convert row with date range to many rows with each date.