What is MapReduce?
There has been a lot of recent talk about MapReduce, particularly in relation to its addition to several of the specialized data warehouse platforms. In this post I will try to answer the question of what is MapReduce and how does it relate to a relational RDBMS (at a high level).
In simplicity MapReduce is a framework that allows developers to write functions that process data. There are two types of key functions in the MapReduce framework, the Map function which separates out the data to be processed and the reduce function which performs analysis on that data. MapReduce is a logical concept, it is not a technology owned by any one vendor but it was made popular by Google as it is part of their search engine core technology.
A very simple example of MapReduce that is commonly used is a process that counts the occurrence of a particular word in a document. The “Map” function in this example would produce a set of data that contained all occurrences the desired word from the source data, the “Reduce” function would then count the number of items produced by the Map function are return a numeric value indicating the total number of word occurrences.
While MapReduce seems very simplistic in its structure, it has been found that this type of 2 stage processing can be used to answer a large number of varied data analysis questions.
The real benefits of MapReduce start to occur when the framework for the execution of its functions is implemented in a large scale, shared nothing data cluster. The platform that implements a MapReduce framework can abstract the complexity of running distributed data processing functions across multiple nodes in the cluster. This allows a developer with no specific knowledge of distributed programming to create their own MapReduce functions and have the platform run those functions in parallel across multiple nodes in the cluster, and then also handle the gathering of the results from across the cluster to return a single result or set. The platform can also abstract important cluster functions such as dealing with nodes that fail during execution, or nodes that are slow to respond during execution.
Again Google is a common example of a large scale implementation of a MapReduce framework. Google is said to run clusters of 10,000 + shared nothing nodes with PetaBytes of data, yet a developer can reportedly write a query that performs analysis across data in these clusters within a relatively short space of time (e.g. 30 mins) as they are purely concerned with the data analysis details not the physical execution details.
MapReduce and RDBMS
MapReduce’s integration with traditional RDBMS’s is now occurring, with some initial implementations been undertaken by some of the specialist data warehouse platform vendors such as Greenplum.
Initially the benefits of MapReduce within a RDBMS are less clear as a distributed shared nothing RDBMS already has a mechanism for abstracting the execution of requests across nodes from the developer, through the use of SQL and the underlying query processor. There is a lot of debate occurring in both academic and commercial fields as to if the inclusion of MapReduce in a relational RDBMS is a positive measure or not.
For SQL and the query processor to be effective, the data has to be structured and have a pre-existing, well defined schema (tables, columns etc). Once this is done optimization techniques, such as indexing, can be implemented making SQL an optimal method for accessing and processing that data. However when the data is unstructured (i.e. without predefined schema) the ability to process that data using SQL becomes a lot more limited. MapReduce on the other has no assumes no predefined schema which allows the developers to create functions that make their own assumptions about the schema definition of the data they are accessing. It is common for a RDBMS to store unstructured data in the form of a BLOB (binary object), examples of this being documents, images, XML files, binary files. A MapReduce function can be written to perform data processing across that unstructured data, for example counting words in a document or finding nodes in XML files.
Much of the debate is centered on if the data itself should be unstructured or not, i.e. with or without schema. If the data does not have a pre-defined schema (unstructured), then processing on that data cannot be validated by an external mechanism, such as the RDBMS. Therefore it is up to the individual functions that make use of that data to verify its integrity. Also without predefined schema, optimization mechanisms such as indexes cannot be created which means the processing of the MapReduce function uses a “brute-force” approach and scans all relevant data across all relevant nodes during the execution of the Map function.
However when the data is structured into tradition database tables and columns, the benefit of MapReduce over traditional SQL for processing that data is less clear. Some of the discussion in this area focuses on the fact that MapReduce functions can be written in native programming languages (perl, java etc) which are more familiar to developers than the SQL language. This argument seems a little weak as due to the universal implementation of SQL across all RDBMS platforms, any experienced developer will have considerable SQL experience.
MapReduce appears to be a simplistic framework for data analysis across shared nothing clusters. The clear benefit of MapReduce is when that analysis is taking place across unstructured data. When structured data is being analyzed the use of SQL and the query processor seems preferable as this makes use of countless optimization techniques, such as indexing and join processing. And finally, much of the debate against the use of MapReduce is less focused on MapReduce itself and more focused on if the underlying data should in fact be structured into table/columns within a traditional RDBMS.