Home > Data warehousing, OpenSource > Big Data : Parallelism and Hadoop:Basics

Big Data : Parallelism and Hadoop:Basics


Let me start this blog by putting up two scenarios in front you:

Scenario I: You are given a bucket full of mixed fruits. There are 3 different kinds of fruits say apple mango and banana. Now how would you calculate the total number of apple, mango and banana in the bucket?

The simplest answer would be to count the fruits taking one by one and in the end getting the required result.

Scenario II: Now suppose instead of a bucket of fruits, you are given a Truck full of mixed fruits. How would you count the total number of individual fruit this time?

The most feasible approach would be to divide the work (instead of count the entire fruit truck one by one). We would take up one basket each full of fruits [mixed up fruits] and give it to different people[WORKER/SLAVE]. Each people count their own basket (irrespective of any communication between the two) and in the end we [MASTER] sum the results of each basket to get the result. Using this approach we would save time and effort [if you would agree].

Well, if you are still wondering why I started off with this scenario, then I have to say that HADOOP is built on this simple basic principle. The above scenario describes as something in technical terminology called as Parallel processing or distributed system programming. There is concept of Master – Worker in parallel processing system. Master divides the work and the worker does the allotted work. The work done by each worker is sent back to the Master.

Similar is the situation with BIG DATA. There is plenty of data available (just like the truck of fruits) which one cannot handle alone and most importantly the 3-V [volume, variety and velocity] factor of the BIG DATA. So to handle such a situation Apache came up with HADOOP – a high performance distributed data and processing system that can store any kind of data from any source at a very large scale and can do very sophisticated analysis of the BIG DATA.

Hadoop architecture is mainly based on the following two components:

1.       HDFS [Hadoop Distributed File System]:

It is more of a storage area for Hadoop. Whenever a data arrives at the cluster*, the HDFS software breaks it into pieces and distributes to the participating servers in the cluster.

2.       MapReduce:

 As the data is stored as fragments across various servers, MapReduce uses its programming logic to compute the required job on these server data and later return the result back to the Master Server. The computation happens locally and parallel across all servers in the cluster [Master – Worker concept].

The picture above describes the Hadoop Ecosystem, which will be explained in details in my later blogs. I hope I am clear with the parallel distributed concept. This concept will be useful in understanding the architecture of Hadoop.

[A bit of History on Hadoop: Hadoop was created by Doug Cutting, who named it after his son’s elephant toy. Hadoop was derived from Google’s MapReduce and Google File System (GFS) papers. Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses.]


*cluster A computer cluster consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system. The components of a cluster are usually connected to each other through fast local area networks, each node (computer used as a server) running its own instance of an operating system. Computer clusters emerged as a result of convergence of a number of computing trends including the availability of low cost microprocessors, high speed networks, and software for high performance distributed computing

[Source: Wikipedia [Hadoop History] and Google]





  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: