Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Hadoop Mapreduce

Introduction:

Hadoop Mapreduce is the processing component of Apache Hadoop using which we can process the big data that is stored in HDFS.
The basic definition of Mapreduce is that it is a Programming model or a software framework which process large amount of data in parallel on large clusters of commodity hardware.

Why Mapreduce?

The bigdata that is stored in HDFS is not stored in a traditional fashion. The data gets divided into chunks of data which is stored in respective data nodes.The data will not be stored in one centralized location.Hence a native client application cannot process the data right away and hence we needed a specialized framework to process the data stored in nodes in different machines and bring back the results.

Two biggest advantages of Hadoop mapreduce:


  1. Parallel Processing :

Here in Mapreduce we are dividing the data and processing in multiple machines and processing will be done parallelly across all the machines. Hence time needed to process the whole big data will be comparitavely less with traditional systems.

      2. Data Locality:

Here in Mapreduce we are moving the processing unit or Processing logic to data instead of moving data to processing unit. Moving huge data to prcessing consumes lot of network bandwidth and is very cost consuming as well.

Mapreduce Algorithm:

Generally Mapreduce program executes in three stages - Map stage,Shuffle and sort stage, Reduce stage.

  • Map Stage:  The map or mapper job is to process the input data. Generally the input files will be in the form of file or directory. The input file is passed to mapper line by line. The map stage process the data and divides into small chunks of data.
  • Reduce stage:  This stage is the combination of shuffle and sort stage and Reduce stage. This stage takes the input from the mapper function and process it and produces the output which will be stored in to HDFS.

Mapreduce analysis:


Generally Mapreduce task works on Key value pair. A Map takes the Input as a key value pair (K1,V1) and gives an output as list of Key value (List (K1,V1)). Now the list of key value goes through a shuffle phase and an input of key and list of values is given to reducer ((K2,List(v2)). Finally the Reducer gives the output as List of key value pairs (List(K3,v3)).


An example to understand the Key value concept MapReduce way.


Here we have an input which conatins the lines of data in a file which is the input to a mapreduce program or task. The input will be actually divided into various inputs which is known as Input splitting on the basis of the new line character. The first line split is (Deer Bear River) the second line split is (Car Car River) and the third line split is (Deer Car Bear).Here the key value is the combination of line offset and the whole line(K1,V1).
The next phase is the mapping phase.The mapping function will create the list of key value pairs List(K2,V2). Here for the first split the output will be (Deer,1 )(Bear,1) (River,1). The value 1 says that the Key itself is 1 count.
The next phase is shuffling phase.Here for every key there is a list prepared (Bear,(1,1)). For every key here the values are added into a list (K2,List(V2)). (Bear,(1,1)) says that there are two occurences of bear.
Next is the Reducing phase. Here aggregation of values happens for every key (Bear,2). The final result will be sent back to the client with a list of (K3,V3) key and aggregated value pair.

Hadoop Mapreduce Wordcount program example:

Lets see a practical example which shows the key value pair concept in Mapreduce.
This program is written in java which is in eclipse. Make sure you add all the libraries required to run the hadoop mapreduce program either in eclipse or from Unix shell. Here i have installed the Hadoop in a virtual box in an Ubuntu OS. Eclipse is also available in Ubuntu.
Now before we go into the program lets see some of the details that we should be knowing before writing a Wordcount program.
The input datasets that are fed to Mapred program and output datasets that are coming out of the Mapred program can only be of certain formats.

Input Format :


Here we have InputFormat which is the super class and inside we have three types of formats which are FileInputFormat,Composable InputFormat and DBInputFormat. FileInputFormat is the commonly used input format.Under FileInputFormat we have supbtypes which are defined in the above diagram. TextInputFromat is the commonly used Inputformat which reads the input file line by line.

Output Format:

 
Here we have OutputFormat as the superclass. FileOutputFormat is one of the subclasses of OutputFormat which is generally used.TextOutputFormat is commonly used outputFormat which writes into a textfile line by line.

Packages to import:


Below are the packages that needs to be imported before starting a Wordcount program.Some of the packages are Java packages and some are present in Hadoop-common.jar and some of them are present in hadoop-mapreduce-client-core.jar. Both the jar files are available in hadoop package and needs to be imported to the program.
 


Mapper Class:


 In order to create a Map class we need to extend a super class that is Mapper class that is defined in the Hadoop package. This Mapper class takes 4 arguments as input and they are of particular types like the input key is LongWritable and the input value Text. The output Key is Text and the output value is IntWritable.

Reducer Class:


Here the Reduce class needs to extend the Superclass Reducer. The Reducer class also takes 4 arguments as the input and they are particular types like the input key is Text and the input value is IntWritable. The output key is Text and the output valueis IntWritable.

Lets get into the actual Wordcount program and see how we can execute it.

Mapper class:


Inside the Mapper class we need to write a map function.Here one of the class used in the Map function is Context which is used to write the output of the Map function. Here we are converting the value to a string and  is asigned to a Stringtokenizer object which will break the input lines to string tokens. Context.write actually writes the map output which will be the word which is broken and the value 1.Here Map function will be executed for every line of the input file.

Reducer Class:


Here we need to call reduce function. One of the parameters of reduce function is Iterable values which means the input values are passed in a list. For all the values present in values parameter the for loop gets execute and the summation is stored in variable sum. Context is used here to write the output which has the key and the count of the occurence.

Driver Class:




Driver class is the one actually driving or executing the entire Mapreduce program. We need to define a Configuration object using which we can define the configuration of wordcount example.Next is we need to define a job object that gets executed on a hadoop cluster.We need to pass Configuration object as the 1st parameter and the name of the Mapreduce Program as the 2nd parameter.
We need to set the jar by class which will have the argument as the name of the main class in driver.In the same way set Mapper class  and reducer classes. We need to set the Output key class and the Output value class as well.
We need to add the Inputpath which takes the argument based on the command line argument while executing the program and OutputPath as well.

Now Lets see How we can execute this program in Unix shell. Before that export this program into a jar file and give a name to the jar file.
To run a Hadoop jar file below is the command:
hadoop jar
The input and output file paths should be HDFS file paths.
 
 


Now lets see the content in the Output file 
 


So this is a sample Mapreduce Program about the Wordcount Application.


This post first appeared on Big Data Basics, please read the originial post: here

Share the post

Hadoop Mapreduce

×

Subscribe to Big Data Basics

Get updates delivered right to your inbox!

Thank you for your subscription

×