Image by Aranda\Lasch via Flickr
One of my favorite terms at the moment is “Big Data”. While all terms are by nature subjective, in this post I will try and explain what Big Data means to me.
Big Data is the “modern scale” at which we are defining or data usage challenges. Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs.
While Big Data as a term seems to refer to volume this isn’t the case. Many existing technologies have little problem physically handling large volumes (TB or PB) of data. Instead the Big Data challenges result out of the combination of volume and our usage demands from that data. And those usage demands are nearly always tied to timeliness.
Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes. The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal. This is of course the ability to handle an unlimited volume of data, processing all requests in real time.
More than at any point in the past, data related technologies are the focus of research & innovation. But Big Data challenges won’t be solved anytime soon by a single approach. Keeping in mind all the different platforms that Big Data is having an impact on (web, cloud, enterprise, mobile) combined with all the Big Data domain challenges (transaction processing, analytics, data mining, visualization) as well as many of the Big Data characteristic requirements (volume, timeliness, availability, consistency), it is easy to see how no single technology will provide a cover-all solution for the eclectic mix of needs. Instead a broad set of technologies that are each focused on meeting specific set of needs are improving our ability to manage data at scale.
A few common areas of innovation that I describe as technologies relevant to Big Data include: MPP Analytics, Cloud Data Services, Hadoop & Map/Reduce (and associate technologies such as HBase, Pig & Hive), In-Memory Databases, some Distributed NoSQL databaes and some Distributed Transaction Processing databases.
Someone asked me if Big Data was just tools to “try and sell them more relevant crap they don’t want”. While up-sell & targeted advertising are too major uses of Big Data technologies I hope that mine and others work in this field does result achievements more significant than just these.
When describing the point of Big Data I like to think about how the Internet has changed my life in general. By having unlimited & timely access to information we are now better informed in all areas of our existence than ever before. However, we are now facing the problem that there is fast becoming too much data for us to digest in its raw form. To move forward in our understanding we will need to rely on technology to provide timely, summarized & relevant data across all aspects of our lives. This is what those working in Big Data are setting out to achieve.
Image by scottpowerz via Flickr
Next year will be the start of much more difficult times for the existing MPP start ups/ early stage companies (including Greenplum, Vertica, Netezza, Xtreme Data, Kognitio, Aster Data etc). This is because Microsoft introducing an MPP solution is the start of the commoditization of the technology and market (Madison now known as Parallel Data Server). To understand this you need to understand the sales process for MPP. It goes something like:
CIO: We need a data warehouse, what platform should we use?
DBA: We are an [Oracle | SQL Server] shop so use that.
Some time later….CIO: Our data warehouse is very slow and people are complaining.
DBA: The server is too small as you have loaded much more data than planned. We need a bigger box.
Some time later….CIO: Our data warehouse is slow again
DBA: I know but we have the biggest box we can get and we have tuned everything and I am out of ideas.
CIO: Our data warehouse is slow
Consultant: Yes of course it is, you need to use an MPP platform
CIO: We are an [Oracle | SQL Server ] shop so do these vendors have a solution?
Consultant: [Yes but it will cost you | No].
CIO: What about [SQL Server | Oracle ]?
Consultant: [No | Yes but it will cost you].
CIO: What about Teradata?
Consultant: Yes but it will cost you.
CIO: Oh. Any other options?
Consultant: Yes there are a bunch of start ups selling MPP solutions.
CIO: Which one is best?
Consultant: They are all good but all slightly different.
CIO: Ok, make a short list and we will do a proof of concept to see which platform does what we want at the price we want.
Some months later.CIO: Congratulations [Vertica | Greenplum | Netezza | Aster Data | Kognitio ] you have won our business.
Image via WikipediaOracle has published their promises which have reportedly gone a long way to appeasing the EU, so the likely outcome is the takeover of Sun will be approved in January.
1. Continued Availability of Storage Engine APIs. Oracle shall maintain and periodically enhance MySQL’s Pluggable Storage Engine Architecture to allow users the flexibility to choose from a portfolio of native and third party supplied storage engines.
MySQL’s Pluggable Storage Engine Architecture shall mean MySQL’s current practice of using, publicly-available, documented application programming interfaces to allow storage engine vendors to “plug” into the MySQL database server. Documentation shall be consistent with the documentation currently provided by Sun.
2. Non-assertion. As copyright holder, Oracle will change Sun’s current policy and shall not assert or threaten to assert against anyone that a third party vendor’s implementations of storage engines must be released under the GPL because they have implemented the application programming interfaces available as part of MySQL’s Pluggable Storage Engine Architecture.
A commercial license will not be required by Oracle from third party storage engine vendors in order to implement the application programming interfaces available as part of MySQL's Pluggable Storage Engine Architecture.
Oracle shall reproduce this commitment in contractual commitments to storage vendors who at present have a commercial license with Sun.
3. License commitment. Upon termination of their current MySQL OEM Agreement, Oracle shall offer storage vendors who at present have a commercial license with Sun an extension of their Agreement on the same terms and conditions for a term not exceeding December 10, 2014.
Oracle shall reproduce this commitment in contractual commitments to storage vendors who at present have a commercial license with Sun.
4. Commitment to enhance MySQL in the future under the GPL. Oracle shall continue to enhance MySQL and make subsequent versions of MySQL, including Version 6, available under the GPL. Oracle will not release any new, enhanced version of MySQL Enterprise Edition without contemporaneously releasing a new, also enhanced version of MySQL Community Edition licensed under the GPL. Oracle shall continue to make the source code of all versions of MySQL Community Edition publicly available at no charge.
5. Support not mandatory. Customers will not be required to purchase support services from Oracle as a condition to obtaining a commercial license to MySQL.
6. Increase spending on MySQL research and development. Oracle commits to make available appropriate funding for the MySQL continued development (GPL version and commercial version). During each of the next three years, Oracle will spend more on research and development (R&D) for the MySQL Global Business Unit than Sun spent in its most recent fiscal year (USD 24 million) preceding the closing of the transaction.
7. MySQL Customer Advisory Board. No later than six months after the anniversary of the closing, Oracle will create and fund a customer advisory board, including in particular end users and embedded customers, to provide guidance and feedback on MySQL development priorities and other issues of importance to MySQL customers.
8. MySQL Storage Engine Vendor Advisory Board. No later than six months after the anniversary of the closing, Oracle will create and fund a storage engine vendor advisory board, to provide guidance and feedback on MySQL development priorities and other issues of importance to MySQL storage engine vendors.
9. MySQL Reference Manual. Oracle will continue to maintain, update and make available for download at no charge a MySQL Reference Manual similar in quality to that currently made available by Sun.
10. Preserve Customer Choice for Support. Oracle will ensure that end-user and embedded customers paying for MySQL support subscriptions will be able to renew their subscriptions on an annual or multi-year basis, according to the customer’s preference.
It may be premature to assume this is now done and dusted, but for Oracle to publish this I presume they have had it okay'd by the EU first (also reports have been made that the EU has responded positivley to this agreement).
Image by Nathan Lanier via Flickr
VoltDB will be an interesting case to watch. VoltDB (Vertica’s “sister”) is a lightweight DBMS optimized for large scale transaction processing. I don’t know which bits of the architecture they are ok for people to talk about yet so I won’t go into detail on that. But regardless of the technology, VoltDB should be watched because of their transaction processing focus. Many analytics DBMS vendors have entered the market over the last few years, but few transaction processing alternatives have set up shop recently. This is for a few reasons, one major on being the transaction processing market is such a tough nut to crack.
It sounds as if VoltDB has been bootstrapped with funding help coming from a company who is involved in the stock market. Certain areas of FSI obviously have “niche’s” that require high end distributed transaction processing, which is precisely where I am sure they will find their early traction. But what will be interesting is if they can break out of this niche and start to engage the wider ISV community. The go to market will be much different and much more difficult than what they have seen with Vertica. But will luminaries like Stonebraker leading the way, who knows they may make a dent.They funny thing with Michael Stonebraker is most of the companies or institutions he is involved with that I speak to, say that he is spending most of his time on "their" project. I am actually starting to doubt there is one Michael Stonebraker and suspect cloning may somehow be involved…
However, on a different note, is it seems part of IBM’s strategy for gaining customers from Oracle is to make DB2 more compatible with Oracle. They say imitation is the greatest form of flattery so I am not sure if IBM is paying Oracle a huge compliment here? But more seriously, my concern about this strategy is I believe Oracle is very much in aware of, and in control of, their wins & losses and can put in preventative measures when they so desire to block any major hemorrhaging. IBM, I don't think you want to put too much focus on chasing Oracle's cast offs. DB2 is also good in it's own right and you need to do a better job of showcasing the platform to ISV's if you want to retain your pride of place.
Although, at least this may allow ISV’s to more easily support DB2 alongside Oracle.
Image via WikipediaFYI - the thoughts here have been gathered from conversations with several individuals, including an interesting conversation yesterday. As these conversations were off the record I won’t name names here but thanks to those people.
To me, Aster is more aggressively driving their platform into green fields trying to leverage their technology to find new customers and new markets. Greenplum on the other hand is more ‘steady as she goes’, focusing on a more traditional and conservative enterprise data warehousing market (while still innovating ahead of the general purpose behemoth's). The risks are on both sides. When trying to define a new market you risk not finding one or finding one that is too small or “niche” to support your business. With the conservative approach you risk being lumped in with everyone else, and in data warehousing ‘everyone else’ is now quite a long list.
Image by Snooch2TheNooch via Flickr
I was speaking with Michael Stonebraker this morning. I mentioned that lately many have been referencing comments he has made over the last couple of years. And I also mentioned that many had interpreted them as he was implying the RDBMS is “doomed”. Mike has been saying the same thing for years, but the current NoSQL movement seems to have picked up on this and highlighting one of the RDBMS's own pioneers is predicting its demise.
I asked Mike to clarify this. My interpretation of his response is as follows. I understand that he doesn’t believe the relational database itself is doomed. Instead the current general purpose implementations, or “elephants” using his words, were out of date. By moving away from a historical GP function into something more specific in focus, either in transaction processing or analytics, you can easily get 50x performance improvement over GP RDBMS. This doesn’t necessarily mean moving away from the “relational” nature, but instead changing some core design principles for how a RDBMS is implemented. It is this improvement factor that will see “new” specialist platforms overtake “old” general purpose platforms. That is gradually, over time. However Mike also mentioned the relational data model doesn’t make sense in a number of disciplines, particularly in sciences, and alternative modeling paradigms will offer benefits to this market (hence his focus on SciDB). So while relational is a valid data model, other data models are also needed.
I have a similar position to Mike, but perhaps with a few differences.
- Firstly I agree with the mantra that current GP RDBMS platforms provide only a “middle of the road” capability, and we gone too far in using a GP RDBMS for everything. However I do believe there is a long term future for the GP RDBMS. A general purpose application requirement will continued to be well suited for a general purpose platform. With a specialist only approach, a general purpose requirement may need both a specialist OLTP platform and a specialist Analytics platform to provide the same capability.
- I agree that with an extreme requirement, either analytics or transaction processing, a specialist platform is well suited. But I don’t see the choices of just MPP or memory resident RBDMS as being a broad enough set. Apps that use a db just as a persistence cache will benefit from a high performing, scalable database platform with much tighter integration with the object model. I am not sure any of the current NoSQL platforms have it quite right yet, but when these guys eventually get together with the database guys and work on these things together they may get there.
- I don’t think a 50x performance speed up on its own is enough to drive change in OLTP. I have written before how difficult it is to get into this market and how tight Oracle, Microsoft & IBM have this sewn up. But I don’t believe it is impossible, I think you just need to bring slam dunks on multiple fronts (performance just being one of them).
Anyway I feel like I am a bit of a broken record at the moment. I have been addressing the “is the RDBMS doomed” question a couple of times a day for some time. Time to focus on something else for a bit.
Image via Wikipedia
There will be plenty of detailed coverage on Exadata V2 so I won’t attempt to replicate that. However I do have a couple of initial thoughts which I would like to share. For those who missed it, Oracle has just announced Exadata V2 (which is their pre-built “machine”). Exadata V1 was built using HP equipment, Exadata V2 is using Sun. The main addition to Exadata V2 seems to be an extra tier in the memory hierarchy, a flash cache. Oracle is very quick to point out this is not flash disk, but it is flash memory, Sun’s FlashFire technology (flash disk or SSD’s was always going to be a transition technology, flash memory doesn’t have the physical constraints of moving parts disk so the whole “disk” concept for flash doesn’t make too much sense other than it fits easily with current architectures).
The new memory layer (Processor Cache’s -> DRAM -> Flash Cache -> Disk) coupled with Oracle’s algorithms to effectively use the Flash Cache layer brings performance benefit to the solution (+ all the other improvements 12 months of hardware innovation brings, faster CPU’s, more memory etc).
My initial thoughts are:
I have had some questions along the lines of “isn’t this back to the one size fits all approach?” Well yes it is, but Oracle never really moved away from this in terms of the core DBMS. It is my understanding that Oracle Exadata was still the general purpose Oracle DBMS & RAC but on a hardware platform optimized for accessing large data sets (making it a data warehousing solution). Using FlashFire, the hardware can now do high levels of random I/O (I think 1m random I/O’s was quoted) which makes the hardware platform general purpose as well.
One interesting question will be if, under Oracle, other vendors can buy the exact same hardware configuration from Sun and optimize their DBMS for Flash also? If so, it may be difficult for them to do this in a way that is price competitive. And will competitive DBMS vendors really want to help fill Oracle’s pockets further?
If we expect to see more of this hardware alignment between DBMS vendors where does that leave Microsoft? Maybe HP is already peeling the Exadata V1 logos off their racks and sticking Microsoft Madison logo’s in their place?
Oracle has put out a FAQ which partly answers some of the questions.