May 03, 2010

Riptano for Cassandra

Riptano

Cassandra is one of the most interesting NoSQL platforms at the moment.  And by most interesting what I really mean is the most clearly justifiable.  Some NoSQL platforms offer new data models, improved query interfaces and/or good single node performance through relaxed consistency models.  As a database guy however, the justification for throwing out the RDBMS baby and bathwater is still difficult at this point as NoSQL platforms tend to be highly focused in one aspect of data management, and very immature in all other areas.  Cassandra is somewhat different as it is more mature in a number of key areas (albeit still immature in others).  Areas that can make Cassandra more justifiable for the right project, when compared with a more traditional RDBMS based solution.  This is because Cassandra’s primary capabilities can’t easily be replicated on those traditional mainstream platforms.

Cassandra’s primary focus is on scalability.  More specifically that is scalability combined with reasonable functionality and performance & availability when at scale.  While some other platforms are trying to bolt on scalability/availability to their functionality rich data engines, Cassandra already has proven real life examples running 150 node clusters.  Notable uses of Cassandra include Digg, Facebook, Twitter, Reddit & Rackspace.  And the feedback from these sites is very good; commonly Cassandra has been expressed as the hands down winner for transaction processing performance at scale.

One of the key contributors to Cassandra has been Jonathan Ellis and until recently he has been working on Cassandra while employed by RackSpace.  But, I was pleased to hear that Jonathan, and business partner Matt Pfeil, have taken the step of setting up their own Cassandra focused company, Riptano.

Riptano are providing the commercialized support services around the open source Cassandra that are necessary for the platform to survive and grow.  While such services may be less important for adoption from the techie rich Web 2.0 crowd, for any platform to become mainstream there needs to be an escalation path for companies uninterested or unable to tinker with the code themselves.  Riptano provides those services which can allow Cassandra use to start to grow further.

Just as importantly, this move gives representation to Cassandra and provides an entity whose best interests will be served through advocacy of the platform.  While Jonathan and others had been doing a fine job of this to date personally, another corporation investing commercial dollars into advocacy will be important to ensure Cassandra’s message isn’t drowned out by more highly funded alternatives.

Riptano has received some early funding from RackSpace and I believe already has a few customers signed for their support services.  Best luck Jonathan & Matt.

Reblog this post [with Zemanta]

May 01, 2010

Ingres Vectorwise smokes it!

I work in all markets of the database industry, from web & startup through the largest and most established enterprises.  And to be completely honest, the name Ingres has not come up in conversation very much at all.  10 years ago maybe more often, but recently not all that much.  But Ingres has been quietly ticking away.  Despite being largely off the radar, they still have a sizable and loyal customer base, global offices and a focused & dedicated management team.  And importantly they have an open source business model which actually appears to be working.

I wrote last year that their "behind the scenes" status had the potential to change.  Ingres had been very clever and worked out a partnership relationship with Peter Bonzc’s Vectorwise.  And that relationship was promising big things for data analytics from a price/performance perspective.  But at the time it was all promise and little in the way of substance had been produced.

But that has been changing.  A month or two back Ingres somewhat quietly launched their Beta program for the Ingres Vectorwise technology.  This technology, if you have not read about it before, combines an analytical column store and “vectorized processing” to give much greater throughput rates than previously possible on your existing hardware (Vectorwise is a single node solution i.e. not MPP) .

And I have started hearing feedback, and it is good.  Very good.  While Ingres Vectorwise isn’t fully baked yet, I have heard it is producing astounding performance results in early testing.  In one case I heard of <10TB real life production comparison test and Ingres Vectorwise smoked everything else they had tested.  And they have tested a lot of different market leading analytical platforms.

So I think this is the start of an Ingres’s comeback.  Certainly anyone looking at <10TB analytical platforms will be getting the recommendation that they at least look at Ingres Vectorwise from me.  I am looking forward to seeing what 2010/2011 brings for them.

Reblog this post [with Zemanta]

April 26, 2010

MongoDB 30,000 downloads a month?

logo-mongodb-ondarkImage by Cesar Rodas via Flickr

While I have written about NoSQL generating a lot of buzz recently I have also written that when compared with the activity that is occurring day in, day out on relational databases it is very minor.  I would suggest those working with NoSQL databases are still a fraction of a percent of those working with more traditional relational databases.

Which is why I was surprised to read recently over on Intelligent Enterprises blog an interview with 10gen founder Dwight Merriman:

“When his company first started making MongoDB available for free downloads last year, they numbered a few hundred a month. But traffic has rapidly built up to a level of 30,000 downloads a month, he said.”

This high number peaked my interest so I quickly did a little checking on the MongoDB site.  Couldn’t find any download stats but I did notice some stats relating to the number of people who had signed up to the support forums:

  • The  mongodb-user Google group has 1682 members
  • The mongodb-announce group, the “This group is for releases and important updates to MongoDB that anyone running MongoDB in production  should subscribe to” has 53 members.
  • The MongoDB site blog has 916 subscribers in Google Reader

I could be wrong and the figure might be accurate, but perhaps this may have actually been page views on the MongoDB site rather than software downloads?  It if is accurate then I will take my hat off to 10gen, they have come much further than I thought.

I am one of the biggest proponents of “the right tool for the right job”, and I think NoSQL databases can be the right tool in a lot of cases.  But we need to keep our heads about us also.  We still have a very long way to go before any of this NoSQL stuff is considered mildly mainstream.

BTW, I will ping Charles Babbock for comment.

* Edit: I have removed the MySQL stuff.  I understand MySQL has 70,000 downloads a day for comparison.
Reblog this post [with Zemanta]

April 14, 2010

NoSQL Buzz

I have noticed a definite increase in NoSQL buzz over the last few months.  This is partly confirmed by Google Trends, this service shows data relating to how search topics rank:

Googletrends_nosql

The last couple of months has seen a dramatic rise in both the number of searches and also the number of news items relating to NoSQL. 

But the traditionalists need not yet fret, interest in NoSQL is yet but a blip on the data management radar, as demonstrated by this compairson between NoSQL and MySQL search rankings:

Googletrends_mysql

I will be interesting to see how the dynamics of this change throughout 2010 though.

Reblog this post [with Zemanta]

March 09, 2010

Investment in Relational?

I have noticed a sharp change of focus in venture funding for data orientated companies over the last six months.  Many VCs have lost some interest in funding data start ups that are doing anything around relational data management.  Instead the interest is in  NoSQL technologies, from key/value stores through to Hapdoop based data management layers.

I am highly supportive in the development, and therefore the funding, of a more diverse set of big data technologies than those based on the relational model alone.  However I also advise caution to not throw the baby out with the bathwater.  Relational data management technologies continue to be a focus of innovation. There are companies working on game changing step forwards which have relational under-pinnings.

The relational model is going to continue to be the underlying model of most of the worlds structured data for the foreseeable future.  Many opportunities for innovation exist and will continue to exist around this fundamental model into the future.

A mindset that relational is yesterdays technology and non-relational is tomorrows defies conventional wisdom and will lead to great opportunities being missed.

January 31, 2010

What is Big Data?

Exhibit: AggregationsImage by Aranda\Lasch via Flickr

One of my favorite terms at the moment is “Big Data”.  While all terms are by nature subjective, in this post I will try and explain what Big Data means to me.

So what is Big Data?

Big Data is the “modern scale” at which we are defining or data usage challenges.  Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs.

While Big Data as a term seems to refer to volume this isn’t the case.  Many existing technologies have little problem physically handling large volumes (TB or PB) of data.  Instead the Big Data challenges result out of the combination of volume and our usage demands from that data.  And those usage demands are nearly always tied to timeliness.

Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes.  The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal.  This is of course the ability to handle an unlimited volume of data, processing all requests in real time.

So what are Big Data technologies?

More than at any point in the past, data related technologies are the focus of research & innovation.  But Big Data challenges won’t be solved anytime soon by a single approach.  Keeping in mind all the different platforms that Big Data is having an impact on (web, cloud, enterprise, mobile) combined with all the Big Data domain challenges (transaction processing, analytics, data mining, visualization) as well as many of the Big Data characteristic requirements (volume, timeliness, availability, consistency), it is easy to see how no single technology will provide a cover-all solution for the eclectic mix of needs. Instead a broad set of technologies that are each focused on meeting specific set of needs are improving our ability to manage data at scale. 

A few common areas of innovation that I describe as Big Data technologies include: MPP Analytics, Cloud Data Services, Hadoop & Map/Reduce (and associate technologies such as HBase, Pig & Hive), In-Memory Databases, some Distributed NoSQL databaes and some Distributed Transaction Processing databases.

So what is the point of Big Data?

Someone asked me if Big Data was just tools to “try and sell them more relevant crap they don’t want”.  While up-sell & targeted advertising are too major uses of Big Data technologies I hope that mine and others work in this field does result achievements more significant than just these.

When describing the point of Big Data I like to think about how the Internet has changed my life in general.  By having unlimited & timely access to information we are now better informed in all areas of our existence than ever before.  However, we are now facing the problem that there is fast becoming too much data for us to digest in its raw form.  To move forward in our understanding we will need to rely on technology to provide timely, summarized & relevant data across all aspects of our lives.  This is what those working in Big Data are setting out to achieve.


Reblog this post [with Zemanta]

December 15, 2009

The Commoditization of MPP

Neatly Stacked ServersImage by scottpowerz via Flickr

Next year will be the start of much more difficult times for the existing MPP start ups/ early stage companies (including Greenplum, Vertica, Netezza, Xtreme Data, Kognitio, Aster Data etc).  This is because Microsoft introducing an MPP solution is the start of the commoditization of the technology and market (Madison now known as Parallel Data Server).  To understand this you need to understand the sales process for MPP.  It goes something like:

CIO: We need a data warehouse, what platform should we use?
DBA: We are an [Oracle | SQL Server] shop so use that.
CIO: Ok.

Some time later….

CIO: Our data warehouse is very slow and people are complaining.
DBA: The server is too small as you have loaded much more data than planned. We need a bigger box.
CIO: Ok.

Some time later….

CIO: Our data warehouse is slow again
DBA
: I know but we have the biggest box we can get and we have tuned everything and I am out of ideas.

CIO: Our data warehouse is slow
Consultant
: Yes of course it is, you need to use an MPP platform
CIO: We are an [Oracle | SQL Server ] shop so do these vendors have a solution?
Consultant: [Yes but it will cost you | No].
CIO: What about [SQL Server | Oracle ]?
Consultant: [No | Yes but it will cost you].
CIO: What about Teradata?
Consultant: Yes but it will cost you.
CIO: Oh.  Any other options?
Consultant: Yes there are a bunch of start ups selling MPP solutions.
CIO: Which one is best?
Consultant: They are all good but all slightly different.
CIO: Ok, make a short list and we will do a proof of concept to see which platform does what we want at the price we want.

Some months later.

CIO: Congratulations [Vertica | Greenplum | Netezza | Aster Data | Kognitio ] you have won our business.

You see the problem in this approach for the existing MPP vendors is much of the trickle down that is occurring now is going to be caught higher up by the shear fact that Microsoft has MPP.  This must be a big worry and I think we will see some consolidation of MPP vendors before 2012.

Reblog this post [with Zemanta]

Relaxed raises funds - CouchDB

Apache CouchDBImage via Wikipedia

I just noticed Damian Katz (an original author CouchDB) has raised $2m in early funding for his CouchDB related company Relaxed.  I spoke with Damian a couple of months back and have a fair idea of what they have in mind (is exciting btw) and wish them well on this endeavor.
Reblog this post [with Zemanta]

End is in sight for Oracle & Sun

Sun Rack Mount ServersImage via Wikipedia

Oracle has published their promises which have reportedly gone a long way to appeasing the EU, so the likely outcome is the takeover of Sun will be approved in January.

My own personal opinion has been the anti-competitive stance really didn’t hold much water.  Reading Oracle’s promises, none appear very extreme (largely agreeing to maintain the status quo) which would lead you to question why it has taken so long to sort out.  But importantly for getting this resolved they are a concession by Oracle and a win for the EU.

 Hopefully shortly the mop up can begin.

The full Oracle press release is here and Curt Monash’s related post is here.

1. Continued Availability of Storage Engine APIs. Oracle shall maintain and periodically enhance MySQL’s Pluggable Storage Engine Architecture to allow users the flexibility to choose from a portfolio of native and third party supplied storage engines.

MySQL’s Pluggable Storage Engine Architecture shall mean MySQL’s current practice of using, publicly-available, documented application programming interfaces to allow storage engine vendors to “plug” into the MySQL database server. Documentation shall be consistent with the documentation currently provided by Sun.

2. Non-assertion. As copyright holder, Oracle will change Sun’s current policy and shall not assert or threaten to assert against anyone that a third party vendor’s implementations of storage engines must be released under the GPL because they have implemented the application programming interfaces available as part of MySQL’s Pluggable Storage Engine Architecture.
A commercial license will not be required by Oracle from third party storage engine vendors in order to implement the application programming interfaces available as part of MySQL's Pluggable Storage Engine Architecture.
Oracle shall reproduce this commitment in contractual commitments to storage vendors who at present have a commercial license with Sun.
3. License commitment. Upon termination of their current MySQL OEM Agreement, Oracle shall offer storage vendors who at present have a commercial license with Sun an extension of their Agreement on the same terms and conditions for a term not exceeding December 10, 2014.
Oracle shall reproduce this commitment in contractual commitments to storage vendors who at present have a commercial license with Sun.

4. Commitment to enhance MySQL in the future under the GPL. Oracle shall continue to enhance MySQL and make subsequent versions of MySQL, including Version 6, available under the GPL. Oracle will not release any new, enhanced version of MySQL Enterprise Edition without contemporaneously releasing a new, also enhanced version of MySQL Community Edition licensed under the GPL. Oracle shall continue to make the source code of all versions of MySQL Community Edition publicly available at no charge.

5. Support not mandatory. Customers will not be required to purchase support services from Oracle as a condition to obtaining a commercial license to MySQL.

6. Increase spending on MySQL research and development. Oracle commits to make available appropriate funding for the MySQL continued development (GPL version and commercial version). During each of the next three years, Oracle will spend more on research and development (R&D) for the MySQL Global Business Unit than Sun spent in its most recent fiscal year (USD 24 million) preceding the closing of the transaction.

7. MySQL Customer Advisory Board. No later than six months after the anniversary of the closing, Oracle will create and fund a customer advisory board, including in particular end users and embedded customers, to provide guidance and feedback on MySQL development priorities and other issues of importance to MySQL customers.

8. MySQL Storage Engine Vendor Advisory Board. No later than six months after the anniversary of the closing, Oracle will create and fund a storage engine vendor advisory board, to provide guidance and feedback on MySQL development priorities and other issues of importance to MySQL storage engine vendors.

9. MySQL Reference Manual. Oracle will continue to maintain, update and make available for download at no charge a MySQL Reference Manual similar in quality to that currently made available by Sun.

10. Preserve Customer Choice for Support. Oracle will ensure that end-user and embedded customers paying for MySQL support subscriptions will be able to renew their subscriptions on an annual or multi-year basis, according to the customer’s preference.

It may be premature to assume this is now done and dusted, but for Oracle to publish this I presume they have had it okay'd by the EU first (also reports have been made that the EU has responded positivley to this agreement).

Reblog this post [with Zemanta]

Is Cassandra winning the NoSQL race?

Cassandra is fast emerging as one of the key NoSQL databases.  While we often express that the point of NoSQL is to offer more choice than an “RDBMS” hammer for every nail, there are practical reasons why a small number of stack technologies gain dominance and others circle on the sidelines.


Cassandra has already ticked many of the boxes needed to shoot it into the stratosphere as a widely used, default database platform.  Especially so in the web world where high scalability, high availability, open source and being proven by a bigger fish all matter.  Specifically Cassandra has:
  • The ability to scale across many nodes
  • The ability to scale to many hundreds of gigabytes of data
  • High availability, losing a node doesn’t take down the cluster & online node provisioning and data distribution (and automated data copy).  Also is decentralized (every node is the same as another, no single point of failure).
  • Bigtable like “Column Families” (more advanced schema control than DHT)
  • Dynamo like eventual consistency (not a plus but a trade off required for scalability) & log based recovery and the ability to either write asynchronously or synchronously

Cassandra, if you’re note familiar, was built originally by Facebook as an internal database system required to help them scale to their massive data demands.  It was then thrown over the wall and made open source, where the community picked it up and ran with it.  Cassandra is capable of supporting transaction processing workloads at large scale and has found favor at RackSpace, Twitter, Digg and others.

Interestingly, I understand Facebook forked the code and have continued to develop their own internal version independently of the open source version.  The open source Cassandra is now largely developed by RackSpace where they have 3 people working full time (+ the community at large) lead by Jonathan Ellis & Digg.  The reasons behind this aren’t entirely clear, but one may assume that Facebook were happy to share their work with the community, but don’t have the time or interest in managing the ongoing development of an open source project.

Scale is the primary reason why you would choose a platform like Cassandra.  Traditional RDBMS’s start to struggle when you want to go over 1 node, and big clusters are currently only really possible using expensive shared disk technology or when targeting specialized analytical workloads (MPP RDBMS).  I understand Facebook is running a 150 node Cassandra cluster and others have 30+ node clusters in production also.

What Cassandra is majorly lacking right now (apart from secondary indexes which I think they are working on) is the backing of a commercial vendor who is providing product support (RackSpace are not doing this).  But I am sure this will be addressed in the near future with either RackSpace spinning something up or someone like Cloudera adding it to their responsibilities.
Reblog this post [with Zemanta]

© Tony Bain