Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Run your first Big Data project using Hadoop and Docker in less than 10 Minutes!

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

In this guide, we will learn how to run our first project using Hadoop and Docker. I will leave the example code at the end of the guide. So, let’s start!

Setup the Environment

First, we need to prepare our environment. Make sure you have Docker and Docker-compose installed on your machine.

  1. Clone this repository: git clone https://github.com/big-data-europe/docker-hadoop.
  2. Build and run containers: docker-compose up -d. The -d flag is used to run the containers in the background.
  3. To check if the containers are running: docker ps.

To the container world

Now, we want to move our files to the container and configure everything there. Of course, you can configure the volume mount and there will be no need to move any files. Here, I am assuming you have no prior knowledge of docker-compose and will move the files to the container.

  1. Create a directory for the input files and the code: mkdir .
  2. Copy the input files to the HDFS: docker cp namenode:/tmp/.
  3. Copy the code to the HDFS: docker cp .java namenode:/tmp/.
  4. Open a shell in the name node container: docker exec -it namenode /bin/bash. You can now run HDFS commands.
  5. Navigate to the directory where the code is located: cd /tmp/. For organization purposes, you can create a directory for the code and move the code to that directory: mkdir and mv . You can also create an output directory for the generated classes: mkdir classes.

To the Hadoop World

We want to configure the HDFS by exporting its path and creating the required directories for the code and the input files.

  1. Export the HADOOP_CLASSPATH: export HADOOP_CLASSPATH=$(hadoop classpath).
  2. Create the required directories in the HDFS:
    1. Create the root directory for this project: hadoop fs -mkdir /.
    2. Create the directory for the input files: hadoop fs -mkdir //Input.
    3. Copy the input files to the HDFS: hadoop fs -put /tmp/ //Input.
You can open UI for HDFS at http://localhost:9870.

Compile your Code

You don’t have to install Java, the docker image has it already installed and we will use it within the container only.

  1. Compile the code: javac -classpath $HADOOP_CLASSPATH -d ./classes/ ./.java. The -d flag is used to specify the directory where the classes are located and the . is used to specify the current directory.
  2. Put the compiled code in a jar file: jar -cvf .jar -C ./classes ..
Don’t forget to hit the Clap and Follow buttons to help me write more articles like this.

Run your First Job

To run the code: hadoop jar .jar //Input/ //Output.

You can find your HDFS files at http://localhost:9870/explorer.html#/.

Check the Output

  1. Check if the job was successful: hadoop job -list all. The job name is the second column of the output and the status is the third column.
  2. To check the output: hadoop fs -cat //Output/*. This will display the output in the terminal. To save the output to a file: hadoop fs -cat //Output/* > .
  3. To retrieve the output file from the HDFS: docker cp namenode://Output/.

Appendix

You can use those files for testing.

  • Create a file called WordCount.java with the following content:
// Java imports
import java.io.IOException;
import java.util.StringTokenizer;

// Hadoop imports
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

// WordCount class
public class WordCount {

// This is the mapper class
public static class TokenizerMapper
extends Mapper{

private final static IntWritable one = new IntWritable(1); // To count the number of words
private Text word = new Text(); // To store the word

// This is the map function
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString()); // Tokenize the input
// For each word, emit the word and 1
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

// This is the reducer class
public static class IntSumReducer
extends Reducer {
private IntWritable result = new IntWritable(); // To store the sum of the words

// This is the reduce function
public void reduce(Text key, Iterable values,
Context context
) throws IOException, InterruptedException {
int sum = 0; // To store the sum of the words
// For each word, add the count
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum); // Set the result
context.write(key, result); // Emit the word and the sum
}
}

// This is the main function
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration(); // Create a new configuration
Job job = Job.getInstance(conf, "word count"); // Create a new job
job.setJarByClass(WordCount.class); // Set the jar by class
job.setMapperClass(TokenizerMapper.class); // Set the mapper class
job.setCombinerClass(IntSumReducer.class); // Set the combiner class
job.setReducerClass(IntSumReducer.class); // Set the reducer class
job.setOutputKeyClass(Text.class); // Set the output key class
job.setOutputValueClass(IntWritable.class); // Set the output value class
FileInputFormat.addInputPath(job, new Path(args[0])); // Set the input path
FileOutputFormat.setOutputPath(job, new Path(args[1])); // Set the output path
System.exit(job.waitForCompletion(true) ? 0 : 1); // Wait for the job to complete
}
}
  • Create a file called input.txt with the following content:
Mostafa
Wael
Mostafa
Kamal
Wael
Mohammed
Mohammed
Mostafa
Kamal
Wael
Mostafa
Mostafa

This example counts the occurrence of each word in the input.txt.

Don’t forget to hit the Clap and Follow buttons to help me write more articles like this.

👋 If you find this helpful, please click the clap 👏 button below a few times to show your support for the author 👇

🚀Join FAUN Developer Community & Get Similar Stories in your Inbox Each Week


Run your first Big Data project using Hadoop and Docker in less than 10 Minutes! was originally published in FAUN Publication on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share the post

Run your first Big Data project using Hadoop and Docker in less than 10 Minutes!

×

Subscribe to Top Digital Transformation Strategies For Business Development: How To Effectively Grow Your Business In The Digital Age

Get updates delivered right to your inbox!

Thank you for your subscription

×