Despite Microsoft’s considerable investment in SQL Server 2005 if never captured the strong market momentum that SQL Server 2000 did. The problem, essentially it was a victim of its own previous success. SQL Server 2005 for all its merits was not significantly compelling enough, in a lot of cases, to motivate organizations to upgrade existing installations. New database applications installations used 2005, which has seen SQL Server 2005 steadily grow its install base. But here we are 4 years after release and most production databases still remain on SQL Server 2000 (fast approaching its 10 year aniversary). My own data shows the breakdown as follows:
Production SQL Server Installs
54% - SQL Server 2000
45% - SQL Server 2005
1% - SQL Server 2008
Those SQL Server 2000 installations are now quickly reaching end of life. Their hardware is out of date (32bit), and the database platform is now out of date. Upgrades are becoming a forced requirement. SQL Server 2005 will now be skipped with upgrades going straight to SQL Server 2008.
With millions of production SQL Server 2000 installations still in existence, I see a strong period of growth for those making tools that assist with the upgrade or consolidation of SQL Server databases.
The two primary trends in data management that have been happening for as long as I can remember are:
The expectations of the volume of data we are can produce and consume is growing rapidly
The expected delay between data production and consumption are decreasing rapidly
We have seen ‘typical’ data volumes of databases grow from MB through GB to a point currently where TB databases are common, and PB databases are the “big guys”. But at the same time we have seen the expectations around the timeliness of response from these databases also change. What used to be a monthly report became a weekly, then a daily and finally it is not uncommon to have near real-time expectations for databases in terms of data retrieval and analysis. We have been on a continual path towards the point where data is consumed at the same moment in which it is created, either in raw form or in an aggregated or otherwise processed state.
At the other end of the application stack, our ability to move more data around faster has led to new styles of applications that provide users near immediate access to data as it is created. Popular consumer web examples of such applications include Facebook, Twitter, Friend Feed etc.
But at the moment these applications aren’t real time, they are near real time. This means there is a delay of some form between data creation and consumption. These delays may be very short or several minutes depending on the particular application and its current workload. These delays may seem irrelevant for the above mentioned apps, but the difference between “near real-time” and “real-time” can have a significant impact on the application functionality. I am sure we have all been frustrated when checking in at the airport and choosing a seat, only to get the “sorry that seat is no longer available” once you click the ok button for your selection for example.
The Problem with the RDBMS
The problem with the traditional RDBMS is that it is not a real time system. It is poll based. This means a query is constructed, submitted and the results are returned to the application. This itself may happen very quickly, maybe only a few ms to execute and receive a resultset. However the problem is of course, the data is only “valid” for the exact moment when the query was executed. From that moment onwards the data becomes stale and numerous changes could be happening on the data within the RDBMS while the extracted resultset is processed.
NOTE: Yes I am aware that the disconnected approach is modern and a server side cursor approach used to be common. We moved away from server side results processing for scalability purposes, but regardless even with server side resultset processing you weren’t automatically updated with the data changed.
Using my example above, while I am deciding if I want a window or an isle or if it is better to have a middle seat at the front of the plane or an isle at the back, the underlying data set could be receiving numerous updates. When I finally make my selection the dataset could be completely invalid requiring me to start the whole process again.
While this is a very simplistic example, the issue here is the trend towards real-time in the user experience layer is not supported by the current interfacing mechanisms to a RDBMS. While we are seeing AJAX etc being used to provide an interface which can update data in real time, underneath likely that data is still being collected from polled queries running intermittently.
Real time & Efficiency
One solution to this problem may be simply to run our polling cycles at such a high rate that the difference between real-time and near real-time becomes indistinguishable. This is possible but of course, it comes at a high cost in terms of impact on scalability.
Let me use a fictitious example to highlight this. Imagine a Twitter like messaging system. This system is to provide a real time like experience to their users so they set a 2 second polling cycle for all client update queries.
For the purpose of this example, let us assume that we have 1 million users. Those 1 million users have a different usage profiles, for this example let us assume that:
50% of users get 1 message a day
20% of users get 10 messages a day
15% of users get 30 messages a day
10% of users get 200 messages a day
4% of users get 1000 messages a day
1% of users get 5000 messages a day
Ok, a couple more assumptions:
To poll and retrieve an empty poll requires 5 “resources” (CPU, DISK, NETWORK)
To poll and retrieve a message requires 50 “resources” (CPU, DISK, NETWORK)
Now let’s compare a system which polls the database every 2 seconds with an alternative system in which messages are pushed from the database on creation to the client on creation.
% User Base
Replies per day
Poll Resources
Push Resources
Push % of Poll
50
1
108025000000
25000000
0.0%
20
10
43300000000
100000000
0.2%
15
30
32625000000
225000000
0.7%
10
200
22600000000
1000000000
4.4%
4
1000
10640000000
2000000000
18.8%
1
5000
4660000000
2500000000
53.6%
100
221850000000
5850000000
2.6%
With the above distributions we would see that a 2 second poll time would have a resource requirement equal to 38x a push based database. This huge overhead is obviously going to be a major overhead and a significant limitation to the upper level of scalability possible.
So What to Do?
I will really address the resolution path for the limitations of the RDBMS when I complete this series in my summing up post. However specific to this issue, there are a couple of things happening which you should be aware of.
Firstly, traditional RDBMS vendors are trying to shoehorn some form of push based results notifications into existing database platforms. For example, SQL Server 2005 and above has query notifications and Oracle & MySQL has something similar (please post in the comments). Current implementations are rudimentary and not suitable for large scale deployment (meant more as a global cache “refresh” event than a user specific resultset update).
Also to watch, there are a couple of startups which have identified the real-time trend that is happening in Silicon Valley, and have also identified that existing RDBMS’s aren’t going to be able to fulfill this trend in current form. They are focusing on re-architecting the RDBMS to be push rather than pull based. GroovyCorp with their SQL Switch product is an organization that I have been speaking to recently. Groovy is the furthest down this particular road that I am aware of, with a real-time push based RDBMS being launched next month.
Todd Hoff has posted a link to a Los Alamos National Lab presentation on Graph Databases. In this paper they provide a revisit on the classic RDBMS vs Graph database debate.
The Relational Database hasn’t maintained its dominance out of dumb luck. Instead the RDBMS has consistently outperformed while providing the most general use capability of all the variety of platforms that have been available. Many other approaches have been tried, often these have provided better object model integration (OODBMS) or better data model representation. But when the rubber has hit the road they have failed on one or more of the key staples of a DBMS – performance, scalability, security, reliability, recoverability & ease of use.
Right now there seems to be more focus and traction than ever before to get it right. Graph databases are interesting and clearly have value in solving the hierarchal abstraction problem currently encountered when modeling such structures in the RDBMS. In other aspects they do share some similarities with the hybrid DHT’s. I think a mix of the best of several approaches will be something interesting (of course it will have to perform extremely well and have great developer support).
It’s such an interesting time to be in data management.
The last 24 months has seen the introduction of Map/Reduce functionality into the data processing arena in various forms. Map/Reduce is a framework for developing scalable data processing functionality, and was popularized by Google (see this earlier post).
Pure players like Hadoop are starting to find their own niche, helped by organizations such as Cloudera. However there has been a number of for & against arguments relating to Map/Reduce functionality inside the database.
These arguments are now really serving a moot point. Customers have recognized value in Map/Reduce prompting some (b)leading edge database vendors to introduce such functionality into their platforms (primarily database vendors providing analytics platforms). Even some of the database platform vendors who were very critical of Map/Reduce 12 months or so ago have softened their position, either embracing Map/Reduce or admitting that Map/Reduce does has benefits in some scenarios for large scale data processing and analytics. If customers see the value of having Map/Reduce in the database and are excited by it, then I don’t want to spend any more time debating if it should be there or not.
Our attention needs to move along from debating if Map/Reduce is something we should have in our database toolset or not. We now need to start thinking about how we use this new tool effectively and what new possibilities Map Reduce opens up.
Tony Bain is an expat Kiwi, Father, Entrepreneur, Angel Investor, Blogger, and occasional Writer for Read Write Web. He is a Director for RockSolid SQL and the founder of Tony Bain Group.