Big Data – Parallel Programming – MapReduce and Hadoop


(T) One of the many challenges of providing cloud-based consumer Web services is first to invest in the required computing infrastructure, and second to maximize the ROI of that infrastructure. Parallel computing is the way to go to distribute the tasks and loads of a cloud-based application over many computers to concurrently leverage hundreds if not thousands of CPUs, storage, I/Os. And, parallel programming is the way to design the cloud-based application that will run over large sets of distributed resources.

To simplify data processing on large clusters, two Google engineers Jeffrey Dean and Sanjay Ghemawat imagined the MapReduce concept. MapReduce enables to perform simple computations while hiding the details of parallelization, such as data distribution, load balancing, and fault tolerance. Google has implemented MapReduce for processing large amounts of raw data, such as crawled documents and Web request logs. A typical computation of MapReduce processes many terabytes of data on thousands of nodes. One of the advantages of MapReduce is that it is easy to use. MapReduce programs are as easy as parallel programming can be.

The basic principle of MapReduce is its division of the computation into two parts: a Map, and a Reduce. Map basically processes all the data, splits them into sub-parts, and sends the sub-parts to different nodes, so that all the pieces run at the same time. Reduce takes the results from the sub-parts, and combines them back together to supply a single output data.

This model comes from the Map and Reduce combinators from Lisp and many other functional languages. In Lisp, a Map takes as input a function and a sequence of values. It then applies the function to each value in the sequence. A reduce combines all the elements of a sequence using a binary operation.

As in Lisp, MapReduce takes conceptually inputs as a list of records. The records are split among the different machines by the Map. The result of the Map computation is a list of key/value pairs. Reduce takes each set of values that have the same key, and combines them into a single value. So Map takes a set of data sets, and produces key/value pairs; Reduce merges the results so that instead of a set of key/value pair sets, you get one final result.

MapReduce has been implemented into the open source software Hadoop. It has received many contributions from major Web-based consumer companies such as Yahoo, Facebook, and LinkedIn. Written in Java, Hadoop scale to petabytes of data implements a MapReduce engine, and its own Hadoop Distributed File System (HDFS). On July 2011, Facebook announced that their Hadoop cluster had grown to 30 petabytes of data! A commercial version of Hadoop is marketed by Cloudera.

Copyright © 2005-2012 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.

Categories: Back-End, Big Data