We now want to get in touch with the "classical" big data technologies in the Hadoop context. HDFS for big data storage is the first component of Hadoop that we want to get to know.
Understanding HDFS in the DataProc environment and interacting with the filesystem in a basic manner.
Please create the cluster in analogy to the lab "Fundamentals: Batch Processing".
HDFS is running in our cluster. Unlike cloud storage, this service is tightly bound to the DataProc instance and especially "gone" when we delete the cluster. Thus, we'll only use HDFS for one-time demonstration purposes. Our "long-term" data lake for example will be based on cloud storage, since this can be accessed by all services in GCP.
When clicking on "VM instances" one can see that the three virtual machines are running.
Click on "SSH" next to the master node. A window should pop up and after a few seconds, you should be connected to the Hadoop master node machine's shell.
All commands to interact with the cluster are started with "hadoop fs". You can list the top-most directory ("/") of our HDFS with this command, showing three directories (hadoop, tmp, and user)
hadoop fs -ls /
This should be the output:
Our goal is to upload the text file "lorem_ipsum.txt" located in the datasets folder (in the course materials) to HDFS.
Let's start our workaround and first upload the file to the master node. Go to the pop up window containing the shell, click on the small gear icon and select "Upload file".
Select lorem_ipsum.txt on your local machine which you downloaded from the course materials folder. When finished, you can enter "ls" in the shell and check if the file shows up:
Next, let's upload the file to the HDFS temporary directory "/tmp":
hadoop fs -put lorem_ipsum.txt /tmp
Check if the upload worked with this command:
hadoop fs -ls /tmp
You can take a look at the last lines of the file with this command (it is a normal text file with some made-up text):
hadoop fs -cat /tmp/lorem_ipsum.txt
Your shell should look like this:
You can close the SSH window now.
Navigate in the GCP cloud console to the DataProc cluster's tab "Web interfaces":
Select HDFS NameNode. The WebUI of the NameNode is opened (which is running on the master node of our cluster):
There, you can show for example how many active DataNodes participate in our HDFS cluster:
Under Utilities you'll find a (rudimentary) browsing tool for the HDFS filesystem:
You should see the uploaded file "lorem_ipsum.txt" there also:
Important: In case you want to continue with the next lab, you can omit this step and delete the cluster later.
Within the cloud environment you pay by use. Since the DataProc Cluster costs ~3$ per day, we now want to tear it down. Unfortunately, DataProc does not allow "stopping" the cluster (like cloud SQL did). Thus, we'll need to delete the cluster to avoid costs.
Go to the cluster details page and hit "Delete":
Congratulations, you now got in touch with a "classical big data file system", HDFS.