Cassandra is one of the most interesting NoSQL platforms at the moment. And by most interesting what I really mean is the most clearly justifiable. Some NoSQL platforms offer new data models, improved query interfaces and/or good single node performance through relaxed consistency models. As a database guy however, the justification for throwing out the RDBMS baby and bathwater is still difficult at this point as NoSQL platforms tend to be highly focused in one aspect of data management, and very immature in all other areas. Cassandra is somewhat different as it is more mature in a number of key areas (albeit still immature in others). Areas that can make Cassandra more justifiable for the right project, when compared with a more traditional RDBMS based solution. This is because Cassandra’s primary capabilities can’t easily be replicated on those traditional mainstream platforms.
Cassandra’s primary focus is on scalability. More specifically that is scalability combined with reasonable functionality and performance & availability when at scale. While some other platforms are trying to bolt on scalability/availability to their functionality rich data engines, Cassandra already has proven real life examples running 150 node clusters. Notable uses of Cassandra include Digg, Facebook, Twitter, Reddit & Rackspace. And the feedback from these sites is very good; commonly Cassandra has been expressed as the hands down winner for transaction processing performance at scale.
One of the key contributors to Cassandra has been Jonathan Ellis and until recently he has been working on Cassandra while employed by RackSpace. But, I was pleased to hear that Jonathan, and business partner Matt Pfeil, have taken the step of setting up their own Cassandra focused company, Riptano.
Riptano are providing the commercialized support services around the open source Cassandra that are necessary for the platform to survive and grow. While such services may be less important for adoption from the techie rich Web 2.0 crowd, for any platform to become mainstream there needs to be an escalation path for companies uninterested or unable to tinker with the code themselves. Riptano provides those services which can allow Cassandra use to start to grow further.
Just as importantly, this move gives representation to Cassandra and provides an entity whose best interests will be served through advocacy of the platform. While Jonathan and others had been doing a fine job of this to date personally, another corporation investing commercial dollars into advocacy will be important to ensure Cassandra’s message isn’t drowned out by more highly funded alternatives.
Riptano has received some early funding from RackSpace and I believe already has a few customers signed for their support services. Best luck Jonathan & Matt.
While I have written about NoSQL generating a lot of buzz recently I have also written that when compared with the activity that is occurring day in, day out on relational databases it is very minor. I would suggest those working with NoSQL databases are still a fraction of a percent of those working with more traditional relational databases.
“When his company first started making MongoDB available for free downloads last year, they numbered a few hundred a month. But traffic has rapidly built up to a level of 30,000 downloads a month, he said.”
This high number peaked my interest so I quickly did a little checking on the MongoDB site. Couldn’t find any download stats but I did notice some stats relating to the number of people who had signed up to the support forums:
The mongodb-user Google group has 1682 members
The mongodb-announce group, the “This group is for releases and important updates to MongoDB that anyone running MongoDB in production should subscribe to” has 53 members.
The MongoDB site blog has 916 subscribers in Google Reader
I could be wrong and the figure might be accurate, but perhaps this may have actually been page views on the MongoDB site rather than software downloads? It if is accurate then I will take my hat off to 10gen, they have come much further than I thought.
I am one of the biggest proponents of “the right tool for the right job”, and I think NoSQL databases can be the right tool in a lot of cases. But we need to keep our heads about us also. We still have a very long way to go before any of this NoSQL stuff is considered mildly mainstream.
BTW, I will ping Charles Babbock for comment.
* Edit: I have removed the MySQL stuff. I understand MySQL has 70,000 downloads a day for comparison.
I have noticed a definite increase in NoSQL buzz over the last few months. This is partly confirmed by Google Trends, this service shows data relating to how search topics rank:
The last couple of months has seen a dramatic rise in both the number of searches and also the number of news items relating to NoSQL.
But the traditionalists need not yet fret, interest in NoSQL is yet but a blip on the data management radar, as demonstrated by this compairson between NoSQL and MySQL search rankings:
I will be interesting to see how the dynamics of this change throughout 2010 though.
I have noticed a sharp change of focus in venture funding for data orientated companies over the last six months. Many VCs have lost some interest in funding data start ups that are doing anything around relational data management. Instead the interest is in NoSQL technologies, from key/value stores through to Hapdoop based data management layers.
I am highly supportive in the development, and therefore the funding, of a more diverse set of big data technologies than those based on the relational model alone. However I also advise caution to not throw the baby out with the bathwater. Relational data management technologies continue to be a focus of innovation. There are companies working on game changing step forwards which have relational under-pinnings.
The relational model is going to continue to be the underlying model of most of the worlds structured data for the foreseeable future. Many opportunities for innovation exist and will continue to exist around this fundamental model into the future.
A mindset that relational is yesterdays technology and non-relational is tomorrows defies conventional wisdom and will lead to great opportunities being missed.
One of my favorite terms at the moment is “Big Data”. While all terms are by nature subjective, in this post I will try and explain what Big Data means to me.
So what is Big Data?
Big Data is the “modern scale” at which we are defining or data usage challenges. Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs.
While Big Data as a term seems to refer to volume this isn’t the case. Many existing technologies have little problem physically handling large volumes (TB or PB) of data. Instead the Big Data challenges result out of the combination of volume and our usage demands from that data. And those usage demands are nearly always tied to timeliness.
Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes. The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal. This is of course the ability to handle an unlimited volume of data, processing all requests in real time.
So what are Big Data technologies?
More than at any point in the past, data related technologies are the focus of research & innovation. But Big Data challenges won’t be solved anytime soon by a single approach. Keeping in mind all the different platforms that Big Data is having an impact on (web, cloud, enterprise, mobile) combined with all the Big Data domain challenges (transaction processing, analytics, data mining, visualization) as well as many of the Big Data characteristic requirements (volume, timeliness, availability, consistency), it is easy to see how no single technology will provide a cover-all solution for the eclectic mix of needs. Instead a broad set of technologies that are each focused on meeting specific set of needs are improving our ability to manage data at scale.
A few common areas of innovation that I describe as Big Data technologies include: MPP Analytics, Cloud Data Services, Hadoop & Map/Reduce (and associate technologies such as HBase, Pig & Hive), In-Memory Databases, some Distributed NoSQL databaes and some Distributed Transaction Processing databases.
So what is the point of Big Data?
Someone asked me if Big Data was just tools to “try and sell them more relevant crap they don’t want”. While up-sell & targeted advertising are too major uses of Big Data technologies I hope that mine and others work in this field does result achievements more significant than just these.
When describing the point of Big Data I like to think about how the Internet has changed my life in general. By having unlimited & timely access to information we are now better informed in all areas of our existence than ever before. However, we are now facing the problem that there is fast becoming too much data for us to digest in its raw form. To move forward in our understanding we will need to rely on technology to provide timely, summarized & relevant data across all aspects of our lives. This is what those working in Big Data are setting out to achieve.
Last week I spent some time speaking with Kevin Weil, head of analytics at Twitter. Twitter, from a technology perspective, has had a bit of a hard time due to their stability issues in their early days. Kevin was keen to point out that he feels this was due to the incomparable growth Twitter was experiencing at the time and their constant struggle to keep up. Kevin was also keen to show that Twitter prides themselves on striving for engineering excellence, the creation & contribution to new technologies and generally assisting in pushing the boundaries forward. Our conversation naturally centered on analytics at Twitter.
Twitter, like many web 2.0 apps, started life as a MySQL based RBDMS application. Today, Twitter is still using MySQL for much of their online operational functionality (although this is likely to change in the near future – think distributed), but on the analytics side of things Twitter has spent the last 6 months moving away from running SQL queries against MySQL data marts. This was because their need for timely data was becoming a struggle with MySQL, particularly when dealing with very large data volumes and complicated queries. For Web 2.0 the ability to understand, quantify and make timely predictions from user behavior is very much their life blood. When Kevin arrived at Twitter 6 months ago he was tasked with changing the way Twitter analyzed their data. Now the bulk of their analytics is executed using a Hadoop platform with Pig as the “querying language”.
Hadoop is a distributed shared-nothing cluster which locates data throughout the cluster using a virtualized file system. What has made Hadoop particularly popular for large scale deployment is the comparative ease of writing distributed functions through a process known as map/reduce. Map/reduce hides much of the complexity of running distributed functions, even when running over a very large numbers of nodes. This allows the developer to focus on their “application logic” rather than worrying about specifics of the execution process (Hadoop handles distribution of execution, node failures, etc). But in saying this, expressing complicated application logic directly in map/reduce functions can become quite laborious as many pipelined map/reduce functions may be required to take raw data through to a useful processed result. Because of this complexity several higher level scripting languages have appeared to abstract this.
Pig is one such scripting language for Hadoop. Pig takes the developers requirement expressed in the script and produces the underlying map-reduce jobs that are executed on Hadoop. This abstraction is incredibly important as without it the complexity of expressing difficult analytical ‘queries’ directly in map/reduce would be highly time consuming & error prone. This can be thought of as being similar to the way SQL is a higher level abstraction language that hides all the query plan routines (written in C) that operate on the data in a traditional RDBMS. Of course abstraction provides increased efficiency in creating analytical routines, but comes at a performance cost. Kevin quantified his experience, he found typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time. However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken. But of course, if there is a routine that is highly performance sensitive they still have the option to hand-code the native map/reduce functions directly.
Ok, so why use Hadoop and Pig instead of more traditional approach like an MPP RDBMS? Kevin explained that there were a few reasons for this. Firstly Twitter, like many Web 2.0 companies, is committed to open source and likes to use software that has a low entry cost but also allows them to contribute to the code base. Kevin mentioned that Twitter did look at some of the open source MPP RDBMS platforms but were less than convinced of their ability to scale to meet their needs at the time. And the second reason is exactly that, scale. Twitter is understandably coy on their exact numbers, but they have hundreds of Terabytes of data (but less than a Petabyte) and one could assume that to get reasonable performance they are running Hapdoop on a few dozen nodes (this is a guess, Twitter didn’t say). As they grow analytics will become more important to their business, this may expand to hundreds (or thousands) of nodes. A “few hundred” nodes is right on the upper limit on what is possible today with the world’s most advanced MPP RBDMS’s. Hapdoop clusters, on the other hand, grow well into the hundreds and even the thousands of nodes (e.g. at Google, Facebook etc).
So Hadoop was the platform choice, but why Pig? There are other “analytical” scripting languages that sit over Hadoop, notably Hive which was popularized by Facebook (Pig was popularized by Yahoo). On discussing the merits of Pig vs Hive it became apparent that Hive was more in tune with a traditional approach (“database like”). Hive requires data to be mapped to a given structure and the queries (using a SQL like derivative) are submitted against that schema. Pig on the other hand is less prescriptive in terms of schema and individual queries can define the structure of the data for that execution. In addition, Pig is more of a “procedural” language allowing the complicated data flow process to be more easily controlled and understood by the developers.
So, as mentioned, Hapdoop is a batch based job processing platform. Jobs (in this case map/reduce jobs generated from the Pig queries) are submitted and results are returned sometime in the future. Exactly when in the future varies from a few minutes (e.g. they run jobs hourly which only take a few minutes to run) through to many hours for jobs that run over much larger sets of data. This leaves a gap in “near real-time” analytics between the lightweight queries they can run on the transactional system and the more intense Hadoop based analytics. This has been a space that Twitter has been investigating solutions to fill. This space will be used for things like improved abuse detection, issue analysis and so on. Twitter is currently considering their data platform options here including Cassandra, HBase and may even decide to use a closed sourced MPP solution to fill this need (I can’t say what, sorry) due to the lack of suitable open source MPP alternatives.
For more technical info on Twitters use of Hadoop and Pig you can check out Kevin’s slide deck from the recent NoSQL East conference.
As I have mentioned before, the MPP data warehouse space is quite full with many new companies appearing over the last few years. The trick for the newer entrants of course, is to differentiate themselves from the herd to overcome their lack of history and experience.
Aster Data has started to do this with the release of their v4.0 platform. They are now promoting their focus as being on “Big Data Applications” rather than the more generic Big Data Warehousing. This seems to have entailed a rethink about how they were positioning their in-database Map/Reduce functionality (which was obtuse in definition for me at least) and they are now marketing their in-engine code executing capabilities in a much clearer way. That is, to allow the push down of application logic into the MPP environment making Aster Data an MPP Data Application Platform rather than a just a MPP Database Platform. While this may largely just be a change in marketing and semantics (and a new logo), I do think this helps to make Aster stand out and offers them a more unique go to market.
I have yet to look into the details of this, but in theory at least moving higher level application components down into the MPP environment would seem beneficial from a performance and robustness perspective. Interestingly, Teradata has recently been working with SAS to move parts of their analytics stack down into Teradata’s stack.
I was speaking with Michael Stonebraker this morning. I mentioned that lately many have been referencing comments he has made over the last couple of years. And I also mentioned that many had interpreted them as he was implying the RDBMS is “doomed”. Mike has been saying the same thing for years, but the current NoSQL movement seems to have picked up on this and highlighting one of the RDBMS's own pioneers is predicting its demise.
I asked Mike to clarify this. My interpretation of his response is as follows. I understand that he doesn’t believe the relational database itself is doomed. Instead the current general purpose implementations, or “elephants” using his words, were out of date. By moving away from a historical GP function into something more specific in focus, either in transaction processing or analytics, you can easily get 50x performance improvement over GP RDBMS. This doesn’t necessarily mean moving away from the “relational” nature, but instead changing some core design principles for how a RDBMS is implemented. It is this improvement factor that will see “new” specialist platforms overtake “old” general purpose platforms. That is gradually, over time. However Mike also mentioned the relational data model doesn’t make sense in a number of disciplines, particularly in sciences, and alternative modeling paradigms will offer benefits to this market (hence his focus on SciDB). So while relational is a valid data model, other data models are also needed.
I have a similar position to Mike, but perhaps with a few differences.
- Firstly I agree with the mantra that current GP RDBMS platforms provide only a “middle of the road” capability, and we gone too far in using a GP RDBMS for everything. However I do believe there is a long term future for the GP RDBMS. A general purpose application requirement will continued to be well suited for a general purpose platform. With a specialist only approach, a general purpose requirement may need both a specialist OLTP platform and a specialist Analytics platform to provide the same capability.
- I agree that with an extreme requirement, either analytics or transaction processing, a specialist platform is well suited. But I don’t see the choices of just MPP or memory resident RBDMS as being a broad enough set. Apps that use a db just as a persistence cache will benefit from a high performing, scalable database platform with much tighter integration with the object model. I am not sure any of the current NoSQL platforms have it quite right yet, but when these guys eventually get together with the database guys and work on these things together they may get there.
- I don’t think a 50x performance speed up on its own is enough to drive change in OLTP. I have written before how difficult it is to get into this market and how tight Oracle, Microsoft & IBM have this sewn up. But I don’t believe it is impossible, I think you just need to bring slam dunks on multiple fronts (performance just being one of them).
Anyway I feel like I am a bit of a broken record at the moment. I have been addressing the “is the RDBMS doomed” question a couple of times a day for some time. Time to focus on something else for a bit.
I haven’t blogged in over a month now. This is for a number of reasons. Firstly I have been flat out with various activities. This included a trip to VLDB in Lyon mid month. Secondly, a lot of the companies I have spoken with this month aren’t ready to speak publically so hence no blog posts resulting from these sorts of discussions.
However there has been a wiff of a change in the air in terms of focus that is interesting and worth highlighting. After years of lots of innovation around data analytics, OLTP is starting to make a comeback in terms of reclaiming some of the limelight. Much more on this between now and the end of the year, but a couple things to watch:
I was fortunate enough to speak with Marcin Zukowski earlier about VectorWise. If you missed it, VectorWise came out of stealth mode a day or two ago. The have announced a joint partnership with Ingres and essentially are claiming impressive analytic RDBMS performance gains on conventional hardware.
To start with, a key message that I think needs to be communicated here is that this is not a product announcement. Ingres and VectorWise have announced a partnership in which they of course plan to build products together, today those products are still in the works.
VectorWise is a spin out of CWI based on research that was undertaken by Marcin and others, research that centered on MonetDB. Explaining the essence of VectorWise is difficult because it is largely internal DBMS data storage & processing logic, but I will have a go.
The modern RDBMS is based around design principles that stem from general purpose OLTP roots and historical hardware architectures (this is partially true even for some of the newest analytic platforms). These design principles in a nutshell focus on the fact that disk is slow & CPU is fast. Data is seeked or partially scanned off disk and cached. Row-by-row (tuple-by-tuple) operators process that data, passing the outcome of each operator to the next as part of a queries execution plan until ultimately producing the result.
Traditionally I/O is the main bottleneck, so to make the database faster you add more I/O bandwidth. Today, disk requirements may be up to 100x the actual capacity needs, so many disks are necessary to achieve the I/O bandwidth to provide performance for an analytical RDBMS implementation. Even though the RBDMS’s may parallelize query operators across cores, this typically works by partitioning data between cores, yet each is still processing on a tuple-by-tuple basis.
Conventional wisdom? Well maybe. You see disk is only really “slow” when it is doing random seeks. Give a disk something sequential to do on the other hand and things are very different. Modern disks are able to sequentially scan in the range of 150MB per second. An array of 10 disks should therefore be able to return sequentially read data in the range of 1GB per second.
When it comes to databases, column based storage has been found to effectively structure data for a) high levels of compression and b) sequential access. VectorWise makes use of both of these technologies to help it achieve high levels of sequential I/O. The problem now however is that disk may no longer the bottleneck. While we can get 1GB a second sequentially off disk relatively easily & cheaply, processing tuple-by-tuple at this rate is very difficult. As it turns out, a RDBMS’s may only achieve a data processing rate of 50MB a second per CPU core. This makes the CPU processing limitations a big bottleneck for analytics data sets, assuming the above figures we would need over 20 cores to keep up with 10 disks (and of course CPU cores don’t scalability linearly).
If we step out of the database world for the moment into the world of high end computer games, or high end scientific processing, we find their use of current CPU technology is much more advanced than what we are used to. They are using new CPU extensions (MMX, SSE, SS2, SSE4.2 etc) to parallize & pipeline computation within a CPU’s core meaning they are processing orders of magnitude more instructions per core that what a traditional RDBMS typically has been able to. The exact details are too low level to discuss here (many of the research papers are available online) but it is fair to say, modern CPU architectures contain advanced features that to date haven’t effectively been exploited by database vendors.
Enter VectorWise. Their aim is to marry storage technologies which allow high levels of sequential I/O to occur with query processing logic which is designed for modern CPU architectures. Rather than process tuple-by-tuple they are processing “vectors”, groups of tuples, leveraging modern CPU extensions and high levels of on-chip cache to allow the CPU to carry out higher data processing throughput. The result is instead of the 50MB a second in a tuple-by-tuple approach, VectorWise are able to achieve processing rates in the range of 500Mb-1GB a second per core in some situations. This means processing rates of 8GB a second or more could be possible with relatively low end hardware.
“In some situations” is the key point to stress here, this obviously isn’t a blanket gain that applies to all analytic data sets, workloads and query requirements. Just what those situations are will be the key to their technologies success, how well it actually applies to real world data sets and queries. I wouldn’t expect to see too many specific examples on this until a product beta appears. But the theory is VectorWise can offer high levels of processing capabilities with existing mainstream hardware. At this point VectorWise isn’t even focusing on MPP instead they are single node focused. If their scalability claims pan out you can imagine how this could allow a single node solution to be competitive with existing low to mid scale MPP solutions that are based on a more conventional query processing architecture.
This isn’t VectorWise’s only trick up their sleeve. They are also are leveraging research around column based storage, compression, piggy-backed (shared) scans and so on. Much of the research that has been adopted by VectorWise is referenced from their web site.
So VectorWise have impressive technology, so why then partner with Ingres rather than a larger vendor (or going at it alone)? Marcin offers a few reasons. Firstly, as academics they feel strongly that open source is cool so this path was greatly preferred over a relationship with a non-open vendor. Secondly Ingres will allow them to deliver their technology in an uncompromised fashion. Marcin mentioned that if they had partnered with one of the big three vendors, that vendors existing product strategies and investments would have likely meant their ideas could have only been implemented in partial form. Ingres on the other hand is going to allow them more of a green field. And of course, a partnership with Ingres makes sense from a go to market perspective as Ingres already has a worldwide reputation, a global customer base, sales & marketing capabilities etc.
Marcin confirmed that Ingres have an exclusive license to their technology, and first option to acquire them for a certain period of time. This allows Ingres to really invest in the relationship without the fear of the carpet being pulled out from under them.
VectorWise clearly are applying innovative research to analytical RBDMS requirements. But as interesting as the technology sounds, the proof in the pudding will be how well these design principals translate to real-world analytical processing requirements in mainstream product form. This remains to be seen, but Ingres and their community clearly has high hopes.
VectorWise is clearly differentiated when comparison with a traditional mainstream RDBMS running on mainstream hardware. However in this current market we have lots of different approaches to the problems described. Kickfire for example use their own SQL Chip processor to increase data processing rates and other appliance vendors are using FPGAs etc for similar purposes. The comparison of these different approaches and the relative effectiveness of each approach still need to be examined, however a mainstream hardware approach has obvious benefits.
Tony Bain is an expat Kiwi, Father, Entrepreneur, Angel Investor, Blogger, and occasional Writer for Read Write Web. He is a Director for RockSolid SQL and the founder of Tony Bain Group.