Lab "Fundamentals: MapReduce"

We now want to apply further "classical" big data technologies in the Hadoop context. Using MapReduce on top of HDFS is the second component of Hadoop that we want to get to know in more detail.

Goal

Understanding how to execute a MapReduce job in the DataProc environment.

What you'll implement

Use a big data-ready computation environment (Hadoop in the cloud) from the previous lab (HDFS)
Submit a MapReduce job via the GCP cloud console
Check the results in HDFS

Open the Jupyter overview site and click on "Running".

Click on "Shutdown":

Introduction

We will not learn how to write MapReduce code in this course. MapReduce is especially not state-of-the-art in big data analytics anymore. However, big data analytics has its roots in MapReduce and many technologies still rely on or extend the fundamental ideas of parallel processing.

Provided code

In the course's library folder, you'll find a file "WordCount.java". This holds the source code of a MapReduce program that counts the word occurrences in an input text file. This is the "hello world" of big data batch processing.

Map-Phase

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

                private final static IntWritable one = new IntWritable(1);
                private Text word = new Text();

                public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
                        StringTokenizer itr = new StringTokenizer(value.toString());
                        while (itr.hasMoreTokens()) {
                                word.set(itr.nextToken());
                                context.write(word, one);
                        }
                }
        }

The mapper iterates through all words (delimited by a space) in the input file and emits a "one" for each word.

Reduce-Phase

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
                private IntWritable result = new IntWritable();

                public void reduce(Text key, Iterable<IntWritable> values, Context context)
                                throws IOException, InterruptedException {
                        int sum = 0;
                        for (IntWritable val : values) {
                                sum += val.get();
                        }
                        result.set(sum);
                        context.write(key, result);
                }
        }

The reduce class sums up all the "ones" for each key (i.e. word in the text file) and hence counts the number of occurrences.

Upload of the wordcount.jar to cloud storage

The WordCount.java file has already been converted by me to binary Java code in the provided wordcount.jar. This jar-file needs to be accessible by DataProc (jar files are always shipped with job submission, "moving the code to the data"). The most simple way is to just put this file into the bucket "pk-gcs" (topmost folder). Perform this step.

Your bucket should look like this afterwards:

Results

You uploaded the necessary MapReduce code for word counting to cloud storage.
You have a rough understanding of the MapReduce code for word counting.

Job Submission

Please go to the "Jobs" section in DataProc:

Hit "Submit job":

Set the following parameters (leave the default for all others):

Cluster region: us-central1 (should be the default of your cluster)
Cluster: should be automatically selected, otherwise choose your cluster
Job type: Hadoop (equivalent of MapReduce)
Main class or jar: WordCount (I called the main class in the wordcount.jar like this)
Arguments: you'll need to add two arguments subsequently
gs://hdm-kueppers/lorem_ipsum.txt (this is the input file in GCS!; you could also specify an HDFS folder)
/tmp/output_wordcount (this is the output folder; important: it must not exist in HDFS!)
Jar files: gs://pk-gcs/wordcount.jar (see above)

This is the job submission screen (in case yours looks different, please scroll down).

Leave the rest as default and submit the job. Job execution should take around 30s and when finished, you should see a green arrow:

Results inspection

Connecting to the NameNode's WebUI (identical to HDFS lab)

Navigate in the GCP cloud console to the DataProc cluster's tab "Web interfaces":

Select HDFS NameNode. The WebUI of the NameNode is opened (which is running on the master node of our cluster). Under Utilities you'll find a (rudimentary) browsing tool for the HDFS filesystem.

You should see a new folder "output_wordcount" in the /tmp folder:

Within this folder, there are multiple results (part) files, one per "reducer" (depending on the cluster configuration the number of output files can differ). You can download and inspect one part-file (the download, however, might not work):

Results

You learned how to start a Hadoop MapReduce job from the cloud console.
You interacted with HDFS from the NameNode's WebUI and checked the results (partly).

Congratulations, you now got in touch with a "classical big data processing system", MapReduce.