So it’s that time of year again when everyone puts out their predictions for the year ahead. I think predictions are a bit of a waste of time because to be interesting predictions have to be big, but a year really isn’t all that long so actual changes over the course of 2009 are likely to be just small progressions. So instead I have been thinking about the top issues that we face heading into 2009 and here is my Top 10 list for issues in Data Management. In this post I avoid offering solutions to these issues, while I have several ideas on solutions these can be the subject of subsequent posts.
10 - Limits on Scalability
While scalability is on my list it is at number 10 because against popular belief, scalability is only an issue for a very small number of data based applications. Almost all data based applications in use today can be scaled without major issue by increasing the underlying hardware resources. But for those applications for which it is an issue it is usually a major issue, and the most common category of data based application that has such scalability limitations is internet applications and web based services.
The problems with scaling this type of application are firstly, the scalability requirements are hard to predict in advance and secondly they can change instantly based on sudden popularity (aka the Slashdot effect). So currently you either over invest in infrastructure and hope for the growth or you under invest and hope you have time to rapidly add capacity if/when required. When planning “corporate” applications you usually have the benefit of capacity planning projections and the ability to manage rollouts to ensure scalability limitations can be managed and while you try to do this in a web world by nature any projections you do today can be meaningless tomorrow.
Some cloud database platform vendors are looking to help alleviate this by providing solutions that offer on demand scalability. These solutions come in two flavors, firstly running existing database platforms on a virtualized infrastructure and secondly providing a non-relational key/value pair repository (SimpleDB, SSDS etc) where all underlying infrastructure and database operations are abstracted. While these represent positive steps forward I think the issue of completely abstracting infrastructure scalability to an application remains an issue that will receive attention in 2009.
9 - Constantly changing landscape
Vendors are changing/updating platforms faster than organizations and their application vendors are willing or able to respond with migrations to the new platforms. When combined with applications that only work with a particular vendor many organizations have multiple data management platforms with multiple versions of each platform in production use. For example, enterprise organizations that use SQL Server often have SQL Server 2000, 2005 and 2008 servers in production (as well as multiple flavors of Oracle, DB2 and so on). Different vendors and different versions within vendors create a major management overhead. This impacts areas such as consolidation, required skill sets, management processes/policies, tool sets. It also restricts the level of value adding exercises that are taking place by data management teams due to their limited resources being consumed with functions surrounding operational management.
8 – Data Recovery
Recovery is something that people have focused on since we first starting collecting electronic data so how is this still an issue? Well as the demands for electronic data grow the factors that control how data is recovered in the event of failure become more complex.
It is easier to recover an entire data set than just a specific part of it. This is fine if you lose an entire data set but not so fine if you need to recover only a part of a data set. New factors such as pressure to have 24x7 availability and an increase in purely electronic transactions with no paper “backup” mean that recovery of individual records/transactions presently can be a very difficult and time consuming process. Due to the complexity of doing transaction based recovery, most database applications no longer “delete” any data instead you a status flag to indicate that data is in a deleted state. Changes to the data are also logged as change history which is also infrequently actually deleted. This allows for application based recovery of that transaction but significantly contributes to the increasing data volumes being experienced. Methods for ensuring individual pieces of data can be rolled back without keeping it within the highly used data set perpetually is a complex issue that will take time to resolve.
7 - Increasing Data Volumes
Data Volumes are growing rapidly. This graph shows the average database size across 3000 databases for ~ 30 random organizations. This graph shows a 12% increase in the average database size in the last 6 months of 2008.
While storage is cheap such rapidly increasing data volumes brings issues other than just pure storage costs. Backup/recovery time frames, maintenance timeframes, query/access performance, increased CPU & Memory requirements (for batch processing etc). This would be less of a concern if the data was adding value (see the low leverage of data topic above) however today in most database applications, much of the data contained is non active historical data that is there solely for the purpose of future reference should it so be required. Managing the vast volumes of data to ensure the balance between data availability and manageability is maintained moving forward presents an interesting issue.
6 - Lack of trained people
Finding qualified and experienced people in IT has always been an issue, in the area of data management this is especially true. Data management is a niche field and the rapid increase in requirement for data management experts has not been met by an increase in the number of skilled people in the workforce. This has led to skills shortages and organizations being forced to take progressively less experienced and qualified staff onboard.
Lack of skilled resources has a flow on effect which impacts all aspects of data management including performance, security, availability and recoverability. Managing an increasingly complex set of data management problems with a depleting set of expertise is another interesting issue heading into 2009.
5 - Inefficient use of Resources
Modern database systems are typically implemented in a manner which sees a physical server provide a platform for one or a small number of applications. This approach has had its benefits from an implementation perspective including, the complete costs (hardware, software, license, management) can be easily calculated and assigned to a project/department, the risk of impacting other applications is low, sizing an environment is easier as only a single workload is taking into consideration and so on. However due to the continual increase in applications this has lead to large numbers of database servers being implemented. There are commonly dozens of such servers in small organizations, hundreds in medium sized organizations and even thousands of database servers being implemented in large enterprises.
One of the several issues with this approach is that it leads to significant inefficiency in resource utilization across the entire infrastructure. If we define “resource usage” to mean the percentage of available CPU, Memory and I/O capacity in a server we can graph a typical analysis of usage in an enterprise:
This shows us that a typical picture is around 50% of all servers have a 10% or less average resource usage (i.e. half the servers are only having at most 10% of their available resources used). 30% have between 10 and 20% usage and so on down to 5% of all servers having 90% or more of their available resources utilized. This means across the entire environment there is an average resource usage of 32% meaning 68% of resources are not being put to work.
4 - Auditability
Organizations are under increased pressure to audit every action that a user performs within a database. This is due to increased focus on security, risk, accountability and avoidance of fraud and corruption. While security prevention measures (logins, firewalls, tokens etc) are important to prevent unauthorized access to the data in the first place, as this survey shows, most breaches occur by users who are authorized but are either negligent or malicious.
image from ars technica
The problem with auditing is that itself generates a lot of data, potentially much more than even the database that it is monitoring. Also the act of auditing a database can cause a significant performance load on the database being audited. I think this is a particularly important issue facing enterprise organizations especially heading into 2009.
3 - Data Security
Security is always an issue for data. The biggest issue facing data security I think is not actually securing the information that is contained within a database, but ensuring security is maintained on the data once that is has its original data source. As we expand ways in which a piece of data may be consumed or modified, through methods such as API’s, web services and other integration means, currently it is up to each step in that integration chain to ensure appropriate security is provided.
To date this is typically controlled by users making access decisions for services based on the data they own, and a mixture of specific copy/access protection systems in some cases for data the users don’t own (such as DVD’s, music and software). However heading into 2009 a significant issue around how universally groups of data can be secured and have that security survive distribution remains open.
2 - Decentralized Data Management
When we used to talk about data management we almost always were talking about database management. Even today this is mostly the case. But this is changing. Data is becoming more distributed and the source of data to a “data consumer” may not necessarily be a database but instead may be a “data service”. How the data service is made up and if it has a database under the covers is often irrelevant to the consumer as the service rightly abstracts the underlying architecture.
Data services are commonly used today as data sources or destinations and data is pushed and pulled from service A to location B though integration processes (e.g. old world ETL processes). Due to data volumes and the need to have timely data, pressure is mounting to instead build applications that use the data services directly as the data source and perform any necessary integration in real time. This reduces unnecessary data duplication and increases the timeliness but creates a whole set of issues to do with consistency, recoverability and availability that will require resolution.
1 - Low leverage of data assets
Organizations have a lot of data. A quick survey across some random business finds that on average a SMB has about 500GB of data in total in their database systems. An “enterprise” is much more difficult to average as this will range from the 10’s of TB well into the PB’s (PetaBytes). And of course there is large volumes of data outside of the database systems in files, email etc also. But at the moment for most organizations much of this data is used purely operationally, which means the data is being used for the application in which it was created but no additional value is being derived from this data through wider analysis.
It used to be that competitive advantages were gained by moving inefficient manual processes to more efficient alternative through the use of technology, heading forward I think these gains will be less and the larger competitive gains will come from using the collective knowledge to understand and serve customers better.
And that is my list of the top 10 issues in data management. I had a starting list of about 30 issues which I whittled down to 10, I am reasonably happy with this list but on a different day maybe a few in here would have been swapped out for some alternatives. Anyway, I look forward to your comments.
NOTE: The reference data I use in this post is just data I have quickly pulled together or observed, it hasn’t been formally researched or validated and should not be considered fact. This is a blog post not a research project!