The opinions and positions expressed are my own and do not necessarily reflect those of my employer
Image by Délirante bestiole [Lumpen river] via FlickrData Management is an area that I work in and follow with a passion. “Big Data” is really the bleeding edge of this, focusing on the cloud and the requirements for the high end of scale, performance and data volume.
I have had another long hiatus from blogging, again due to my availability. This time I have been working hard with Red Rock to take their RockSolid Product and DBA Out-Task services to the next level. Internally we have done some restructuring to give RockSolid the right model to take it forward the next few years.
Externally, we have launched new web sites, product editions, service offerings and pricing models. It has been a busy few months!Take a look:
Cassandra is one of the most interesting NoSQL platforms at the moment. And by most interesting what I really mean is the most clearly justifiable. Some NoSQL platforms offer new data models, improved query interfaces and/or good single node performance through relaxed consistency models. As a database guy however, the justification for throwing out the RDBMS baby and bathwater is still difficult at this point as NoSQL platforms tend to be highly focused in one aspect of data management, and very immature in all other areas. Cassandra is somewhat different as it is more mature in a number of key areas (albeit still immature in others). Areas that can make Cassandra more justifiable for the right project, when compared with a more traditional RDBMS based solution. This is because Cassandra’s primary capabilities can’t easily be replicated on those traditional mainstream platforms.
I work in all markets of the database industry, from web & startup through the largest and most established enterprises. And to be completely honest, the name Ingres has not come up in conversation very much at all. 10 years ago maybe more often, but recently not all that much. But Ingres has been quietly ticking away. Despite being largely off the radar, they still have a sizable and loyal customer base, global offices and a focused & dedicated management team. And importantly they have an open source business model which actually appears to be working.
I have noticed a definite increase in NoSQL buzz over the last few months. This is partly confirmed by Google Trends, this service shows data relating to how search topics rank:
The last couple of months has seen a dramatic rise in both the number of searches and also the number of news items relating to NoSQL.
But the traditionalists need not yet fret, interest in NoSQL is yet but a blip on the data management radar, as demonstrated by this compairson between NoSQL and MySQL search rankings:
Image by Aranda\Lasch via Flickr
One of my favorite terms at the moment is “Big Data”. While all terms are by nature subjective, in this post I will try and explain what Big Data means to me.
Big Data is the “modern scale” at which we are defining or data usage challenges. Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs.
While Big Data as a term seems to refer to volume this isn’t the case. Many existing technologies have little problem physically handling large volumes (TB or PB) of data. Instead the Big Data challenges result out of the combination of volume and our usage demands from that data. And those usage demands are nearly always tied to timeliness.
Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes. The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal. This is of course the ability to handle an unlimited volume of data, processing all requests in real time.
More than at any point in the past, data related technologies are the focus of research & innovation. But Big Data challenges won’t be solved anytime soon by a single approach. Keeping in mind all the different platforms that Big Data is having an impact on (web, cloud, enterprise, mobile) combined with all the Big Data domain challenges (transaction processing, analytics, data mining, visualization) as well as many of the Big Data characteristic requirements (volume, timeliness, availability, consistency), it is easy to see how no single technology will provide a cover-all solution for the eclectic mix of needs. Instead a broad set of technologies that are each focused on meeting specific set of needs are improving our ability to manage data at scale.
A few common areas of innovation that I describe as technologies relevant to Big Data include: MPP Analytics, Cloud Data Services, Hadoop & Map/Reduce (and associate technologies such as HBase, Pig & Hive), In-Memory Databases, some Distributed NoSQL databaes and some Distributed Transaction Processing databases.
Someone asked me if Big Data was just tools to “try and sell them more relevant crap they don’t want”. While up-sell & targeted advertising are too major uses of Big Data technologies I hope that mine and others work in this field does result achievements more significant than just these.
When describing the point of Big Data I like to think about how the Internet has changed my life in general. By having unlimited & timely access to information we are now better informed in all areas of our existence than ever before. However, we are now facing the problem that there is fast becoming too much data for us to digest in its raw form. To move forward in our understanding we will need to rely on technology to provide timely, summarized & relevant data across all aspects of our lives. This is what those working in Big Data are setting out to achieve.
Image by scottpowerz via Flickr
Next year will be the start of much more difficult times for the existing MPP start ups/ early stage companies (including Greenplum, Vertica, Netezza, Xtreme Data, Kognitio, Aster Data etc). This is because Microsoft introducing an MPP solution is the start of the commoditization of the technology and market (Madison now known as Parallel Data Server). To understand this you need to understand the sales process for MPP. It goes something like:
CIO: We need a data warehouse, what platform should we use?
DBA: We are an [Oracle | SQL Server] shop so use that.
Some time later….CIO: Our data warehouse is very slow and people are complaining.
DBA: The server is too small as you have loaded much more data than planned. We need a bigger box.
Some time later….CIO: Our data warehouse is slow again
DBA: I know but we have the biggest box we can get and we have tuned everything and I am out of ideas.
CIO: Our data warehouse is slow
Consultant: Yes of course it is, you need to use an MPP platform
CIO: We are an [Oracle | SQL Server ] shop so do these vendors have a solution?
Consultant: [Yes but it will cost you | No].
CIO: What about [SQL Server | Oracle ]?
Consultant: [No | Yes but it will cost you].
CIO: What about Teradata?
Consultant: Yes but it will cost you.
CIO: Oh. Any other options?
Consultant: Yes there are a bunch of start ups selling MPP solutions.
CIO: Which one is best?
Consultant: They are all good but all slightly different.
CIO: Ok, make a short list and we will do a proof of concept to see which platform does what we want at the price we want.
Some months later.CIO: Congratulations [Vertica | Greenplum | Netezza | Aster Data | Kognitio ] you have won our business.