January 31, 2010

What is Big Data?

Exhibit: AggregationsImage by Aranda\Lasch via Flickr

One of my favorite terms at the moment is “Big Data”.  While all terms are by nature subjective, in this post I will try and explain what Big Data means to me.

So what is Big Data?

Big Data is the “modern scale” at which we are defining or data usage challenges.  Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs.

While Big Data as a term seems to refer to volume this isn’t the case.  Many existing technologies have little problem physically handling large volumes (TB or PB) of data.  Instead the Big Data challenges result out of the combination of volume and our usage demands from that data.  And those usage demands are nearly always tied to timeliness.

Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes.  The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal.  This is of course the ability to handle an unlimited volume of data, processing all requests in real time.

So what are Big Data technologies?

More than at any point in the past, data related technologies are the focus of research & innovation.  But Big Data challenges won’t be solved anytime soon by a single approach.  Keeping in mind all the different platforms that Big Data is having an impact on (web, cloud, enterprise, mobile) combined with all the Big Data domain challenges (transaction processing, analytics, data mining, visualization) as well as many of the Big Data characteristic requirements (volume, timeliness, availability, consistency), it is easy to see how no single technology will provide a cover-all solution for the eclectic mix of needs. Instead a broad set of technologies that are each focused on meeting specific set of needs are improving our ability to manage data at scale. 

A few common areas of innovation that I describe as Big Data technologies include: MPP Analytics, Cloud Data Services, Hadoop & Map/Reduce (and associate technologies such as HBase, Pig & Hive), In-Memory Databases, some Distributed NoSQL databaes and some Distributed Transaction Processing databases.

So what is the point of Big Data?

Someone asked me if Big Data was just tools to “try and sell them more relevant crap they don’t want”.  While up-sell & targeted advertising are too major uses of Big Data technologies I hope that mine and others work in this field does result achievements more significant than just these.

When describing the point of Big Data I like to think about how the Internet has changed my life in general.  By having unlimited & timely access to information we are now better informed in all areas of our existence than ever before.  However, we are now facing the problem that there is fast becoming too much data for us to digest in its raw form.  To move forward in our understanding we will need to rely on technology to provide timely, summarized & relevant data across all aspects of our lives.  This is what those working in Big Data are setting out to achieve.


Reblog this post [with Zemanta]

December 15, 2009

The Commoditization of MPP

Neatly Stacked ServersImage by scottpowerz via Flickr

Next year will be the start of much more difficult times for the existing MPP start ups/ early stage companies (including Greenplum, Vertica, Netezza, Xtreme Data, Kognitio, Aster Data etc).  This is because Microsoft introducing an MPP solution is the start of the commoditization of the technology and market (Madison now known as Parallel Data Server).  To understand this you need to understand the sales process for MPP.  It goes something like:

CIO: We need a data warehouse, what platform should we use?
DBA: We are an [Oracle | SQL Server] shop so use that.
CIO: Ok.

Some time later….

CIO: Our data warehouse is very slow and people are complaining.
DBA: The server is too small as you have loaded much more data than planned. We need a bigger box.
CIO: Ok.

Some time later….

CIO: Our data warehouse is slow again
DBA
: I know but we have the biggest box we can get and we have tuned everything and I am out of ideas.

CIO: Our data warehouse is slow
Consultant
: Yes of course it is, you need to use an MPP platform
CIO: We are an [Oracle | SQL Server ] shop so do these vendors have a solution?
Consultant: [Yes but it will cost you | No].
CIO: What about [SQL Server | Oracle ]?
Consultant: [No | Yes but it will cost you].
CIO: What about Teradata?
Consultant: Yes but it will cost you.
CIO: Oh.  Any other options?
Consultant: Yes there are a bunch of start ups selling MPP solutions.
CIO: Which one is best?
Consultant: They are all good but all slightly different.
CIO: Ok, make a short list and we will do a proof of concept to see which platform does what we want at the price we want.

Some months later.

CIO: Congratulations [Vertica | Greenplum | Netezza | Aster Data | Kognitio ] you have won our business.

You see the problem in this approach for the existing MPP vendors is much of the trickle down that is occurring now is going to be caught higher up by the shear fact that Microsoft has MPP.  This must be a big worry and I think we will see some consolidation of MPP vendors before 2012.

Reblog this post [with Zemanta]

Relaxed raises funds - CouchDB

Apache CouchDBImage via Wikipedia

I just noticed Damian Katz (an original author CouchDB) has raised $2m in early funding for his CouchDB related company Relaxed.  I spoke with Damian a couple of months back and have a fair idea of what they have in mind (is exciting btw) and wish them well on this endeavor.
Reblog this post [with Zemanta]

End is in sight for Oracle & Sun

Sun Rack Mount ServersImage via Wikipedia

Oracle has published their promises which have reportedly gone a long way to appeasing the EU, so the likely outcome is the takeover of Sun will be approved in January.

My own personal opinion has been the anti-competitive stance really didn’t hold much water.  Reading Oracle’s promises, none appear very extreme (largely agreeing to maintain the status quo) which would lead you to question why it has taken so long to sort out.  But importantly for getting this resolved they are a concession by Oracle and a win for the EU.

 Hopefully shortly the mop up can begin.

The full Oracle press release is here and Curt Monash’s related post is here.

1. Continued Availability of Storage Engine APIs. Oracle shall maintain and periodically enhance MySQL’s Pluggable Storage Engine Architecture to allow users the flexibility to choose from a portfolio of native and third party supplied storage engines.

MySQL’s Pluggable Storage Engine Architecture shall mean MySQL’s current practice of using, publicly-available, documented application programming interfaces to allow storage engine vendors to “plug” into the MySQL database server. Documentation shall be consistent with the documentation currently provided by Sun.

2. Non-assertion. As copyright holder, Oracle will change Sun’s current policy and shall not assert or threaten to assert against anyone that a third party vendor’s implementations of storage engines must be released under the GPL because they have implemented the application programming interfaces available as part of MySQL’s Pluggable Storage Engine Architecture.
A commercial license will not be required by Oracle from third party storage engine vendors in order to implement the application programming interfaces available as part of MySQL's Pluggable Storage Engine Architecture.
Oracle shall reproduce this commitment in contractual commitments to storage vendors who at present have a commercial license with Sun.
3. License commitment. Upon termination of their current MySQL OEM Agreement, Oracle shall offer storage vendors who at present have a commercial license with Sun an extension of their Agreement on the same terms and conditions for a term not exceeding December 10, 2014.
Oracle shall reproduce this commitment in contractual commitments to storage vendors who at present have a commercial license with Sun.

4. Commitment to enhance MySQL in the future under the GPL. Oracle shall continue to enhance MySQL and make subsequent versions of MySQL, including Version 6, available under the GPL. Oracle will not release any new, enhanced version of MySQL Enterprise Edition without contemporaneously releasing a new, also enhanced version of MySQL Community Edition licensed under the GPL. Oracle shall continue to make the source code of all versions of MySQL Community Edition publicly available at no charge.

5. Support not mandatory. Customers will not be required to purchase support services from Oracle as a condition to obtaining a commercial license to MySQL.

6. Increase spending on MySQL research and development. Oracle commits to make available appropriate funding for the MySQL continued development (GPL version and commercial version). During each of the next three years, Oracle will spend more on research and development (R&D) for the MySQL Global Business Unit than Sun spent in its most recent fiscal year (USD 24 million) preceding the closing of the transaction.

7. MySQL Customer Advisory Board. No later than six months after the anniversary of the closing, Oracle will create and fund a customer advisory board, including in particular end users and embedded customers, to provide guidance and feedback on MySQL development priorities and other issues of importance to MySQL customers.

8. MySQL Storage Engine Vendor Advisory Board. No later than six months after the anniversary of the closing, Oracle will create and fund a storage engine vendor advisory board, to provide guidance and feedback on MySQL development priorities and other issues of importance to MySQL storage engine vendors.

9. MySQL Reference Manual. Oracle will continue to maintain, update and make available for download at no charge a MySQL Reference Manual similar in quality to that currently made available by Sun.

10. Preserve Customer Choice for Support. Oracle will ensure that end-user and embedded customers paying for MySQL support subscriptions will be able to renew their subscriptions on an annual or multi-year basis, according to the customer’s preference.

It may be premature to assume this is now done and dusted, but for Oracle to publish this I presume they have had it okay'd by the EU first (also reports have been made that the EU has responded positivley to this agreement).

Reblog this post [with Zemanta]

Is Cassandra winning the NoSQL race?

Cassandra is fast emerging as one of the key NoSQL databases.  While we often express that the point of NoSQL is to offer more choice than an “RDBMS” hammer for every nail, there are practical reasons why a small number of stack technologies gain dominance and others circle on the sidelines.


Cassandra has already ticked many of the boxes needed to shoot it into the stratosphere as a widely used, default database platform.  Especially so in the web world where high scalability, high availability, open source and being proven by a bigger fish all matter.  Specifically Cassandra has:
  • The ability to scale across many nodes
  • The ability to scale to many hundreds of gigabytes of data
  • High availability, losing a node doesn’t take down the cluster & online node provisioning and data distribution (and automated data copy).  Also is decentralized (every node is the same as another, no single point of failure).
  • Bigtable like “Column Families” (more advanced schema control than DHT)
  • Dynamo like eventual consistency (not a plus but a trade off required for scalability) & log based recovery and the ability to either write asynchronously or synchronously

Cassandra, if you’re note familiar, was built originally by Facebook as an internal database system required to help them scale to their massive data demands.  It was then thrown over the wall and made open source, where the community picked it up and ran with it.  Cassandra is capable of supporting transaction processing workloads at large scale and has found favor at RackSpace, Twitter, Digg and others.

Interestingly, I understand Facebook forked the code and have continued to develop their own internal version independently of the open source version.  The open source Cassandra is now largely developed by RackSpace where they have 3 people working full time (+ the community at large) lead by Jonathan Ellis & Digg.  The reasons behind this aren’t entirely clear, but one may assume that Facebook were happy to share their work with the community, but don’t have the time or interest in managing the ongoing development of an open source project.

Scale is the primary reason why you would choose a platform like Cassandra.  Traditional RDBMS’s start to struggle when you want to go over 1 node, and big clusters are currently only really possible using expensive shared disk technology or when targeting specialized analytical workloads (MPP RDBMS).  I understand Facebook is running a 150 node Cassandra cluster and others have 30+ node clusters in production also.

What Cassandra is majorly lacking right now (apart from secondary indexes which I think they are working on) is the backing of a commercial vendor who is providing product support (RackSpace are not doing this).  But I am sure this will be addressed in the near future with either RackSpace spinning something up or someone like Cloudera adding it to their responsibilities.
Reblog this post [with Zemanta]

November 25, 2009

Analytics at Twitter

Twitter

Last week I spent some time speaking with Kevin Weil, head of analytics at Twitter. Twitter, from a technology perspective, has had a bit of a hard time due to their stability issues in their early days.  Kevin was keen to point out that he feels this was due to the incomparable growth Twitter was experiencing at the time and their constant struggle to keep up.  Kevin was also keen to show that Twitter prides themselves on striving for engineering excellence, the creation & contribution to new technologies and generally assisting in pushing the boundaries forward.  Our conversation naturally centered on analytics at Twitter.

Twitter, like many web 2.0 apps, started life as a MySQL based RBDMS application.  Today, Twitter is still using MySQL for much of their online operational functionality (although this is likely to change in the near future – think distributed), but on the analytics side of things Twitter has spent the last 6 months moving away from running SQL queries against MySQL data marts.  This was because their need for timely data was becoming a struggle with MySQL, particularly when dealing with very large data volumes and complicated queries.  For Web 2.0 the ability to understand, quantify and make timely predictions from user behavior is very much their life blood.  When Kevin arrived at Twitter 6 months ago he was tasked with changing the way Twitter analyzed their data.  Now the bulk of their analytics is executed using a Hadoop platform with Pig as the “querying language”. 

Hadoop is a distributed shared-nothing cluster which locates data throughout the cluster using a virtualized file system.  What has made Hadoop particularly popular for large scale deployment is the comparative ease of writing distributed functions through a process known as map/reduce.  Map/reduce hides much of the complexity of running distributed functions, even when running over a very large numbers of nodes.  This allows the developer to focus on their “application logic” rather than worrying about specifics of the execution process (Hadoop handles distribution of execution, node failures, etc).  But in saying this, expressing complicated application logic directly in map/reduce functions can become quite laborious as many pipelined map/reduce functions may be required to take raw data through to a useful processed result.  Because of this complexity several higher level scripting languages have appeared to abstract this.

Twitter


Pig is one such scripting language for Hadoop.  Pig takes the developers requirement expressed in the script and produces the underlying map-reduce jobs that are executed on Hadoop.  This abstraction is incredibly important as without it the complexity of expressing difficult analytical ‘queries’ directly in map/reduce would be highly time consuming & error prone.  This can be thought of as being similar to the way SQL is a higher level abstraction language that hides all the query plan routines (written in C) that operate on the data in a traditional RDBMS.  Of course abstraction provides increased efficiency in creating analytical routines, but comes at a performance cost.  Kevin quantified his experience, he found typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time.  However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken.  But of course, if there is a routine that is highly performance sensitive they still have the option to hand-code the native map/reduce functions directly.

Ok, so why use Hadoop and Pig instead of more traditional approach like an MPP RDBMS?  Kevin explained that there were a few reasons for this.  Firstly Twitter, like many Web 2.0 companies, is committed to open source and likes to use software that has a low entry cost but also allows them to contribute to the code base.  Kevin mentioned that Twitter did look at some of the open source MPP RDBMS platforms but were less than convinced of their ability to scale to meet their needs at the time.  And the second reason is exactly that, scale.  Twitter is understandably coy on their exact numbers, but they have hundreds of Terabytes of data (but less than a Petabyte) and one could assume that to get reasonable performance they are running Hapdoop on a few dozen nodes (this is a guess, Twitter didn’t say).  As they grow analytics will become more important to their business, this may expand to hundreds (or thousands) of nodes.  A “few hundred” nodes is right on the upper limit on what is possible today with the world’s most advanced MPP RBDMS’s. Hapdoop clusters, on the other hand, grow well into the hundreds and even the thousands of nodes (e.g. at Google, Facebook etc).

So Hadoop was the platform choice, but why Pig?  There are other “analytical” scripting languages that sit over Hadoop, notably Hive which was popularized by Facebook (Pig was popularized by Yahoo).  On discussing the merits of Pig vs Hive it became apparent that Hive was more in tune with a traditional approach (“database like”).  Hive requires data to be mapped to a given structure and the queries (using a SQL like derivative) are submitted against that schema.  Pig on the other hand is less prescriptive in terms of schema and individual queries can define the structure of the data for that execution.  In addition, Pig is more of a “procedural” language allowing the complicated data flow process to be more easily controlled and understood by the developers.



So, as mentioned, Hapdoop is a batch based job processing platform.  Jobs (in this case map/reduce jobs generated from the Pig queries) are submitted and results are returned sometime in the future.  Exactly when in the future varies from a few minutes (e.g. they run jobs hourly which only take a few minutes to run) through to many hours for jobs that run over much larger sets of data.  This leaves a gap in “near real-time” analytics between the lightweight queries they can run on the transactional system and the more intense Hadoop based analytics.  This has been a space that Twitter has been investigating solutions to fill.  This space will be used for things like improved abuse detection, issue analysis and so on.  Twitter is currently considering their data platform options here including Cassandra, HBase and may even decide to use a closed sourced MPP solution to fill this need (I can’t say what, sorry) due to the lack of suitable open source MPP alternatives.

For more technical info on Twitters use of Hadoop and Pig you can check out Kevin’s slide deck from the recent NoSQL East conference.


Reblog this post [with Zemanta]

November 12, 2009

Aster Data’s breakaway move

Image representing Aster Data Systems as depic...

As I have mentioned before, the MPP data warehouse space is quite full with many new companies appearing over the last few years.  The trick for the newer entrants of course, is to differentiate themselves from the herd to overcome their lack of history and experience.

Aster Data has started to do this with the release of their v4.0 platform.  They are now promoting their focus as being on “Big Data Applications” rather than the more generic Big Data Warehousing.  This seems to have entailed a rethink about how they were positioning their in-database Map/Reduce functionality (which was obtuse in definition for me at least) and they are now marketing their in-engine code executing capabilities in a much clearer way.  That is, to allow the push down of application logic into the MPP environment making Aster Data an MPP Data Application Platform rather than a just a MPP Database Platform.  While this may largely just be a change in marketing and semantics (and a new logo), I do think this helps to make Aster stand out and offers them a more unique go to market. 

I have yet to look into the details of this, but in theory at least moving higher level application components down into the MPP environment would seem beneficial from a performance and robustness perspective.  Interestingly, Teradata has recently been working with SAS to move parts of their analytics stack down into Teradata’s stack.

Reblog this post [with Zemanta]

November 11, 2009

Disappointed for MySQL

REDWOOD SHORES, CA - APRIL 20: A Canadian Goo...Image by Getty Images via Daylife

Like many I was disappointed, yet not surprised, that the EC formally logged their objection to Oracle’s acquisition of Sun on account of MySQL a few days back.  And we also hear today that Oracle will be stating their position in Brussels on the 25th of this month.  To me this case has odd from the onset and as it goes on it is just getting odder.  And of course this all seems to be occuring at immense cost to Sun, Oracle and MySQL themselves.

There are several reasons why this is odd.  One of the key ones is that for some time MySQL has been quite open about their non compete focus with Oracle.

For example, this is an excerpt from a  Jan-2007 interview between Marten Mickos (MySQL CEO at the time) and Linux Journal (keep in mind this was prior to the Sun acquisition).

GM: Does that mean MySQL is not really up against Oracle as a competitor—that you tend to go for new companies?
MM: I would put it differently: they are not up against us when it comes to Web 2.0; we are among the pioneers there, the leaders there.

GM: What about in the traditional markets, do you find that you are starting to compete against Oracle?
MM: We do, but it's not a main area of focus for us. This is the major difference between us and the other open-source databases. Most of the others are trying to become a replacement for Oracle, so if you look at PostgreSQL, EnterpriseDB, Ingres and all those guys, they try to mimic the old-style databases so that they one day can claim that space. But my guess is that by that time, the space will be gone.

Then following the Sun acquisition ,Jonathan Schwartz (Sun CEO at the time) speaking at SugarCRM conference in Feb-2008 made the following comments:

Asked if Sun planned to scale the MySQL database to compete with Oracle, [Jonathan] Schwartz said Sun will not compete with Oracle but "will scale MySQL to extraordinary heights."


Yet the EC remains not so sure.

CLARIFICATION: I am disappointed not because I necessarily want MySQL to be owned by Oracle, but instead because I think this draw out period of uncertainty is doing more to damage MySQL than any acquisition would.

Reblog this post [with Zemanta]

November 10, 2009

Back from Blogging Hiatus - Update 3

Boston (Photogra)phy PartyImage by Nathan Lanier via Flickr

<< Back from Blogging Hiatus - Update 2

Ingres

No specific announcements from Ingres other than I think the VectorWise stuff is progressing well.

To me Ingres is a bit of a dark horse.  They are open source and doing reasonable revenues.  And they are active in the enterprise market (something MySQL hasn’t really achieved).  But they remain largely off the radar in commentary surrounding the DBMS industry.

My personal pick is this will start to change during the second half of next year.  Several things happening in the market (Oracle’s eventual acquisition of MySQL being a major one) and some things they have happening internally (VectorWise being a major one) I think will help to start to propel Ingres back into the RDBMS spotlight, especially in the enterprise.

VoltDB

It sounds like VoltDB is getting closer with some talk of being able to see an early version of the product soon.

VoltDB will be an interesting case to watch.  VoltDB (Vertica’s “sister”) is a lightweight DBMS optimized for large scale transaction processing.  I don’t know which bits of the architecture they are ok for people to talk about yet so I won’t go into detail on that.  But regardless of the technology, VoltDB should be watched because of their transaction processing focus.  Many analytics DBMS vendors have entered the market over the last few years, but few transaction processing alternatives have set up shop recently.  This is for a few reasons, one major on being the transaction processing market is such a tough nut to crack.

It sounds as if VoltDB has been bootstrapped with funding help coming from a company who is involved in the stock market.   Certain areas of FSI obviously have “niche’s” that require high end distributed transaction processing, which is precisely where I am sure they will find their early traction.  But what will be interesting is if they can break out of this niche and start to engage the wider ISV community.  The go to market will be much different and much more difficult than what they have seen with Vertica.  But will luminaries like Stonebraker leading the way, who knows they may make a dent.

They funny thing with Michael Stonebraker is most of the companies or institutions he is involved with that I speak to, say that he is spending most of his time on "their" project.  I am actually starting to doubt there is one Michael Stonebraker and suspect cloning may somehow be involved…

IBM DB2

I spoke to IBM a few weeks back when they announced their DB2 PureScale technology.  PureScale is actually quite exciting.  But they chose to announce it around the time of Oracle OpenWorld and press attention was largely drowned out but, among other things, Larry’s persistent bagging of IBM.

IBM DB2 PureScale is a technology solution which provides shared-disk clustering for DB2 on IBM Power Systems.  New nodes can be added online (a traditional problem for shared disk clustering), and node failures will not see new requests fail as they will be transparently routed to other available nodes (although I believe in progress transactions will fail).  This is done using the hardware architecture of the Power Systems, and also done in a way that doesn’t require any application code changes.

However, on a different note, is it seems part of IBM’s strategy for gaining customers from Oracle is to make DB2 more compatible with Oracle.  They say imitation is the greatest form of flattery so I am not sure if IBM is paying Oracle a huge compliment here?  But more seriously, my concern about this strategy is I believe Oracle is very much in aware of, and in control of, their wins & losses and can put in preventative measures when they so desire to block any major hemorrhaging.  IBM, I don't think you want to put too much focus on chasing Oracle's cast offs.  DB2 is also good in it's own right and you need to do a better job of showcasing the platform to ISV's if you want to retain your pride of place.

Although, at least this may allow ISV’s to more easily support DB2 alongside Oracle.

XtremeData

XtremeData is yet another vendor to enter the MPP analytics space.  XtremeData is worthy of note because their product is built upon their unique FPGA.  Unlike other FPGA’s I have seen, I understand that theirs plugs into a spare CPU socket in the server.  The FPGA can then provide pushed down data streaming operations on data at rates available to the CPU bus (instead of the PCI bus other some other FPGA approaches use).  Although I haven’t seen any benchmark data yet for what this translates into.

When I spoke to XtremeData their focus seemed to be very much on the very high end.  Large deployments of many nodes, in many racks, handling many hundreds of TB (or PB).  As I have spoken about before, the MPP space is very busy right now.  Most of the companies are naturally focusing on the mid-range MPP needs, so maybe focusing on the very large end is a smart way to differentiate.  This of course may change as they ramp up and I will be curious to see if there actually is a sustainable market at this very top end.

NoSQL

There has been a lot happening in the NoSQL technologies (Mongo, Cassandra, Voldemort etc) which I will comment on in other posts.  But an annoying thing, which can sometimes happen with community open source initiatives, is the level of infighting and bickering has been rising steadily.  And this is not even on important technological decisions.  An example, a lot of the bandwidth of the NOSQL mailing list is debating what to call themselves (which degraded into personal attacks and name calling at one point).  NoSQL vs many other things, and even what the definition of NoSQL is.  This really highlights to me the importance of the commercialized organizations surrounding this technology to keeping providing the necessary beacons to focus on and more this initiative forward.

Reblog this post [with Zemanta]

November 06, 2009

Back from Hiatus - Summary Update 2

Back from Hiatus - Summary Update 1

GoodData

GoodData has launched and they are providing a cloud based analytics platform for use in integration with online apps.  Starting with some initial focus on SalesForce data, but working hard on expanding the list of ISV’s who choose to provide their customers analytics via GoodData.

GoodData was started by “good guy” Czech serial entrepreneur Roman Stanek (NetBeans) and has just raised funds from Andressen Horowitz and appointed Time O’Reilly to the board.  GoodData is interesting because it is simple, accessible and available on demand.  Still early days but think Roman is on to another winner here.  Certainly recommend any ISV building cloud based apps to look at their platform.

Mark Logic

I was keen to learn more about Mark Logic as I didn’t understand their products in any detail.  David and Ron were more than obliging and I sat down with them last week for a run though.

In short, I am impressed by the technology of Mark Logic.  It is a database that uses XML as the schema data model and XQuery as the primary query language.  But it is far more than and XML extension bolted on top of a traditional db engine (such as some of the XML capabilities in the more traditional RBDMS vendors).  Internally Mark Logic has all the important DBMS components but they are designed and optimized around the XML schema (query processor, indexing etc) from the ground up.  I also understand they have distributed multi-node capability, something which is still quite rare over in the general purpose RBDMS world.

Mark Logic has a history in the content publishing market, as you would expect, because much “published” data is naturally represented in XML.  I did sense the team at Mark Logic is keen to break away from this niche a little (while at the same time respecting that this will likely remain their primary market).  Exactly how they go about this isn’t entirely clear to me as the world has kind of moved on from the “XML for everything” excitement that existed in the early 2000’s.  There will be plenty of case-by-case requirements, but a piecemeal market is hard to drive business development.  But publishing remains a clear staple and I am sure they can leverage this into a few more.

I did get somewhat excited when we were talking about serializing JSON in and out of Mark Logic.  This is very topical in the web app market as we see a push towards client based web applications and web service dishing up JSON.  But this is not necessarily a money spinner as there are “free” offerings servicing this need already (CouchDB, MongoDB etc).  I understand Mark Logic is proprietary license so it might be hard to gain traction here.

Kognitio

I spoke briefly with Kognitio a couple of weeks back.  I hear very little about Kognitio so I was keen to speak to them about their progress.  Kognitio is a UK based company and provides a data warehouse appliance, while only launching in the US last year they have a much longer history in the UK.

Kognitio seems to be taking an alternative approach to achieving growth than the one many of the US vendors are using.  While most of the US companies are venture backed and are pushing hard to gain market share, Kognitio on the other hand is privately backed and seems to be taking a slower and more methodical approach.  This has obviously served them well in the UK but it will be interesting how that plays out into the highly crowed, highly competitive US data warehousing scene.  It may turn out to be a true test to see who really does win out of the tortoise and the hare.

Infobright

The big news at Infobright is that Miriam is no longer CEO and she has been replaced by a temporary CEO, board member Mark Burton.  I spoke with Mark a couple of days ago and the reasons cited were around future direction and the next stage in the company’s lifecycle etc.  They are still sorting this all out and expect to be ready to start discussing their new direction in a few weeks.  In saying that, when we spoke I got the feeling their positioning will still very tied to the MySQL customer base, something I tend to disagree with.  But it would be premature to speculate and instead will wait to further information is available.

© Tony Bain