We now want to apply further "classical" big data technologies in the Hadoop context. Using MapReduce on top of HDFS is the second component of Hadoop that we want to get to know in more detail.
Understanding how to execute a MapReduce job in the DataProc environment.
Open the Jupyter overview site and click on "Running".
Click on "Shutdown":
We will not learn how to write MapReduce code in this course. MapReduce is especially not state-of-the-art in big data analytics anymore. However, big data analytics has its roots in MapReduce and many technologies still rely on or extend the fundamental ideas of parallel processing.
In the course's library folder, you'll find a file "WordCount.java". This holds the source code of a MapReduce program that counts the word occurrences in an input text file. This is the "hello world" of big data batch processing.
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
The mapper iterates through all words (delimited by a space) in the input file and emits a "one" for each word.
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
The reduce class sums up all the "ones" for each key (i.e. word in the text file) and hence counts the number of occurrences.
The WordCount.java file has already been converted by me to binary Java code in the provided wordcount.jar. This jar-file needs to be accessible by DataProc (jar files are always shipped with job submission, "moving the code to the data"). The most simple way is to just put this file into the bucket "pk-gcs" (topmost folder). Perform this step.
Your bucket should look like this afterwards:
Please go to the "Jobs" section in DataProc:
Hit "Submit job":
Set the following parameters (leave the default for all others):
This is the job submission screen (in case yours looks different, please scroll down).
Leave the rest as default and submit the job. Job execution should take around 30s and when finished, you should see a green arrow:
Navigate in the GCP cloud console to the DataProc cluster's tab "Web interfaces":
Select HDFS NameNode. The WebUI of the NameNode is opened (which is running on the master node of our cluster). Under Utilities you'll find a (rudimentary) browsing tool for the HDFS filesystem.
You should see a new folder "output_wordcount" in the /tmp folder:
Within this folder, there are multiple results (part) files, one per "reducer" (depending on the cluster configuration the number of output files can differ). You can download and inspect one part-file (the download, however, might not work):
Congratulations, you now got in touch with a "classical big data processing system", MapReduce.