Tuesday, December 31, 2013

My New Years Resolution: Learn About Massively Distributed Database Systems

Database scalability seems to be a problem for many database administrators and data managers. The problem is not "how do I do it" or "what is it" but rather "what do I do with it?" I've worked with the smallest databases in many different traditional RDBMS platforms and I've worked with some pretty huge databases. My last job had several terabyte databases sitting in a single database server -- no clustering. I currently work in an environment with several small databases sitting on an over-sized cluster. Frankly, I think they have it backwards. The big environment should have been clustered and the small environment doesn't need to be. But I digress. The point is, I don't think most database architects know what to do with scalable systems because, frankly, they don't have enough data in their small to medium shops. But that will change -- and soon!

In order to grow and to perhaps work in a bigger shop that has the need for massively distributed data, we have to think bigger and learn about the world outside of our small ponds. Not only that, but our small ponds will be getting bigger. The data landscape is changing quickly.

I'm thinking about this today because I read an article about InfiniSQL (actually, more of a press release) from its principal developer. Some of the concepts didn't make a whole lot of sense to me because I've never worked in an environment that distributes data to multiple geographical locations. Thinking back on the articles I've read recently, it seems like distribution of data is going to be the big problem to solve in the future and it seems like an important niche to dive into. After all, last year's contender for word-of-the-year was Big Data.

But Big Data is conceptual in nature and very few people understand what it really is. Big Data is not a terabyte or two terabytes in a database cluster -- not even ten or twenty or fifty or a hundred. Big Data is more like a hundred terabytes per day. The actual storage systems are dozens to hundreds of petabytes in size. Big Data was actualized by massive players -- Yahoo and Google and maybe even Pixar. I found this out by reading Big Data: A Revolution That Will Transform How We Live, Work, and Think.

What I would like to talk about here is not Big Data. Big Data is going to get bigger -- that's true. Next decade's databases will probably be 10-100 times bigger than the ones we have now, even for small to medium sized firms. Since storage is so cheap we've all begun to hoard data. But in its most fundamental format data -- and especially Big Data -- is less than useless!

The time for some is now -- and the time for the rest of us is soon -- when we will actually have to start leveraging our treasure mounds of data for something beyond simple reports. The new buzzwords are Data Visualization, Data Mining, and the like. We don't just need to store the data, we need to extract useful information from it.

Sure, Data Visualization and Data Mining have been around for a while as concepts but they haven't really entered the lexicon of the common database administrator. The reason is because DV and DM have been so difficult to accomplish without a degree in higher mathematics -- seriously! Have you ever wanted to compare the graphs of two time periods to detect their "sameness" or "differentness?" There's a reason we draw graphs and give them to analysts to interpret -- because the human brain can do so much more with two simple graphs today than a team of computer scientists. The algorithms and mathematics are out there, for the most part, but where data, computer algorithms, and higher mathematics intersect is the trivium (the crossroads) where only a few have been able to tread successfully.

Those people, with that trinity of accessible skills, are the ones who make $400-$500k annually as developers. They make so much money and live such charmed lives because conceptually it's such an exquisitely difficult nirvana to reach. It's hard enough learning how to program, and even harder to understand the intricacies of data, and harder still to master the highest orders of mathematics. Put them together and you have a requirement of 20 years of intense college education peppered with summers of internships and late and lonely nights of coding and experimentation. Unfortunately, twenty years in this industry is a lifetime and everything will have changed in 20 years.

That's why it's important to keep up with the current trends in the data sciences. Sure, there are a few geniuses here and there who have the right mix of skills to command $500k (in a non-executive role) but they are far and few between. For the rest of the world, there are the data scientists who will earn $150k, $125k, $100k, $75k per year. They are not the ones who develop the database software or the analytics software necessarily (albeit there will be some small-scale development for those who can push themselves). Rather, they will be the ones who learn to listen to the tracks to hear the oncoming trains. In other words, they will be the ones who learn to leverage existing software and push their small shops to the bleeding edge by testing prototype systems in strange and new combinations to see what works and what doesn't.

So here's my New Year's resolution. 2014 will be the year in which I push my skills into new territories. I don't just want to read about Big Data, Data Visualization, Data Mining, Statistical Analysis, Distributed Systems, etc. I want to learn their intricacies and nuances and I want to listen to what the data science community is saying about them. I want to learn about horizontal scaling, vertical scaling, large scale clustering, and anything having to do with storing and mining that data for useful information. Perhaps next year's resolution will be to come up with a problem to solve. Until then!

No comments:

Post a Comment