We now want to get in touch with the "classical" big data technologies in the Hadoop context. HDFS for big data storage is the first component of Hadoop that we want to get to know.

Goal

Understanding HDFS in the DataProc environment and interacting with the filesystem in a basic manner.

What you'll implement

Cluster Creation

Please create the cluster in analogy to the lab "Fundamentals: Batch Processing".

Results

  1. You configured and started a Hadoop cluster which is ready for using HDFS.

Connecting to the cluster's master node

HDFS is running in our cluster. Unlike cloud storage, this service is tightly bound to the DataProc instance and especially "gone" when we delete the cluster. Thus, we'll only use HDFS for one-time demonstration purposes. Our "long-term" data lake for example will be based on cloud storage, since this can be accessed by all services in GCP.

When clicking on "VM instances" one can see that the three virtual machines are running.

Click on "SSH" next to the master node. A window should pop up and after a few seconds, you should be connected to the Hadoop master node machine's shell.

Listing the top-most directory in our cluster HDFS

All commands to interact with the cluster are started with "hadoop fs". You can list the top-most directory ("/") of our HDFS with this command, showing three directories (hadoop, tmp, and user)

hadoop fs -ls /

This should be the output:

Uploading a text file

Our goal is to upload the text file "lorem_ipsum.txt" located in the datasets folder (in the course materials) to HDFS.

Upload a text file to the master virtual machine

Let's start our workaround and first upload the file to the master node. Go to the pop up window containing the shell, click on the small gear icon and select "Upload file".

Select lorem_ipsum.txt on your local machine which you downloaded from the course materials folder. When finished, you can enter "ls" in the shell and check if the file shows up:

Load the text file from the master into HDFS

Next, let's upload the file to the HDFS temporary directory "/tmp":

hadoop fs -put lorem_ipsum.txt /tmp

Further Interact with HDFS

Check if the upload worked with this command:

hadoop fs -ls /tmp

You can take a look at the last lines of the file with this command (it is a normal text file with some made-up text):

hadoop fs -cat /tmp/lorem_ipsum.txt

Your shell should look like this:

You can close the SSH window now.

Results

  1. You connected via SSH to the DataProc (Hadoop) cluster's master node.
  2. You uploaded a file from your system to this master node's (local) filesystem as a workaround.
  3. You uploaded a file from the master node's local filesystem to HDFS via the hadoop shell command "hadoop fs -put".
  4. You interacted with HDFS via the hadoop shell commands (ls and tail)

Connecting to the NameNode's WebUI

Navigate in the GCP cloud console to the DataProc cluster's tab "Web interfaces":

Select HDFS NameNode. The WebUI of the NameNode is opened (which is running on the master node of our cluster):

There, you can show for example how many active DataNodes participate in our HDFS cluster:

Under Utilities you'll find a (rudimentary) browsing tool for the HDFS filesystem:

You should see the uploaded file "lorem_ipsum.txt" there also:

Results

  1. You learned how to use the NameNode's WebUI which provides some information about the HDFS cluster.
  2. You interacted with HDFS from the NameNode's WebUI, especially showing folders and files.

Important: In case you want to continue with the next lab, you can omit this step and delete the cluster later.

Reasoning

Within the cloud environment you pay by use. Since the DataProc Cluster costs ~3$ per day, we now want to tear it down. Unfortunately, DataProc does not allow "stopping" the cluster (like cloud SQL did). Thus, we'll need to delete the cluster to avoid costs.

Shutting down the DataProc Cluster

Go to the cluster details page and hit "Delete":

Congratulations, you now got in touch with a "classical big data file system", HDFS.