Home > Data warehousing, Technological Updates > Big Data – Hadoop HDFS and MapReduce

Big Data – Hadoop HDFS and MapReduce


The big data buzz is increasing day by day. So here is a more detailed look at the Hadoop – HDFS and MapReduce.

HDFS or the Hadoop Distributed File System is designed to store a large amount of data in various servers/clusters. The definition of large data needs no explanation (especially when we are talking Big Data).  Data in a Hadoop cluster is broken down in small blocks (default is 64MB) and distributed across the clusters.

The blocks in the cluster are placed based on a block placement algorithm – rack aware. Rack aware algorithm basically determines which block is to be placed in clusters based on the replication factor, which is generally 3x by default.

The basic architecture of HDFS cluster consists of two major nodes namely:

1. Name Node:

This is almost like the Master Node in Greenplum database and the “master” as per the master-slave concept.  The name node manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log.

Now the question arises what if the single name node crashes down (as we have only one primary name node). So, in order to maintain this data, Hadoop provides a secondary name node or Backup Name node. As primary name node is the Single Point of Failure (SPOF), the secondary name node copies the FsImage and EditLog from the Name Node at a particular time.

2. Data Node:

These are the major working blocks of the HDFS. They store and retrieve blocks when they are told to (by the name node), and they report back to the name node periodically with lists of blocks that they are storing. These data nodes are the places where the majority of the data resides.

 

Map Reduce is the second major portion of Hadoop architecture. Map Reduce is the programming logic or the brain as I would like to say. Map Reduce was created by Google which was based on the parallel processing programming logic, written in Java.

The Map Reduce programming model works on two parts – The Mapping part(done by the Mapper) and The Reduction part (done by the Reducer).

The Mapper works on the blocks of data available in the data nodes and tries to get the job done. You can think of Mapper as an individual worker (in the master-slave concept), working to get the data required from the client.

Now the major task remains is to get the aggregate count of the results done by each Mapper. This work is done by the Reducer. The Reducer iterates over the entire result data and sends back a single output value.

Map Reduce programming undergoes through various intermediate stages. Now let’s have a look at the following diagram:

From the diagram above we can see that the user give something as the input. In this case the input is a question and its subsequent answer. These files are stored in the data nodes of the HDFS. The Map-Reduce program looks into given data and breaks the data into an intermediate stage. The intermediate stage consists of a key/value pair, which breaks the file data into many key- value pair data. [If you have studied Compiler Design during your college days, then a look at the key-value stage just reminds me of the lexical analysis, semantic analysis, etc.]. Now after this stage, the sorting or the shuffling of the data takes place. It’s vague to understand from the diagram, but if you look into the second part of the above picture, you will understand the requirement of the sorting phase. The major reason is the availability of various servers or nodes. The Map Reduce makes sure that the shuffling and sorting of the data takes place using the key. Now come the reducer phase, which accepts the data coming from the sorting / shuffling phase and combines the data into a smaller set of values. This data is sent back to the user/client.

The above entire process is controlled by a JobTracker, which coordinates the job run and makes sure everything goes fine. The TaskTracker runs the tasks that the job has been split into.

So this is a brief description of the HDFS and the MapReduce. I didn’t go much deep into the core functionality of Map Reduce as it requires a full scale knowledge of the Java Programming Language. So I guess am able to give a short but detailed explanation on Hadoop. Thanks and take care.

Advertisements
  1. ashutosh gupta
    September 27, 2012 at 12:44 pm

    It’s just a big WoW!

  2. Arnab Guha
    September 27, 2012 at 2:19 pm

    Excellent post 🙂

  3. Rishu Shrivastava
    September 28, 2012 at 8:49 am

    Thanks you Arnab da 🙂

  4. Rishu Shrivastava
    September 28, 2012 at 8:56 am

    Thank you Ashutosh 🙂

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: