This is again our use case: a simple webshop for which we want to allow data scientists to access the data lake.
Loading example source data into cloud storage and accessing these from a Jupyter Notebook..
Please go to the storage browser:
Create a bucket:
Call it for example "your initals-gcs":
Leave all other settings as default. However, please click continue and examine the options (especially the "default storage class" has huge effects on storage costs).
In the bucket details you can create a new folder called "webshop_datalake":
Upload the file "webshop_history.csv" from the folder "Datasets" in the course materials.
Your bucket's details should now look like this:
Please make sure that the following settings are set:
Please navigate to DataProc, which is GCP's Hadoop product:
Hit "create cluster":
Select "Cluster on Compute Engine":
You may want to use the following parameters:
Scroll further down and (1) enable the checkbox "Component gateway" and (2) select Jupyter Notebook under "Optional Components":
Scroll down to the "Optional components" and select "Jupyter Notebook":
Leave the rest as default and click on "Create".
GCP now spins up three virtual machines which together form a so-called Hadoop cluster. One machine is the master (some kind of coordinating unit) and two machines are workers. If you need more "computational power", you'd just add further worker nodes. We will learn how to use this kind of parallel big data processing later. For now, we just want to get to the Jupyter Notebook and access the data lake - later we'll use the notebook to make big data computations with Hadoop (and Spark).
Click onto the link to your cluster:
When clicking on "VM instances" one can see that the three virtual machines are running. We will later use this to connect to the master node.
For now, we only want to use the web interface of the Jupyter notebook. Click on "Web Interfaces" and "Juypter":
You should see a Jupyter Notebook environment:
Let's create a notebook in Cloud Storage, i.e. navigate to GCS and click "New" → "PySpark" (although we will not use the "big data tool" Spark now, let's try if this works):
Afterwards, you should be able to access the data lake now in another cell (change my bucket "pk-gcs" to your bucket's name). The "gs://" prefix tells pandas to access a bucket in GCP cloud storage (you'll see this access pattern often throughout this course!).
import pandas as pd
df = pd.read_csv("gs://pk-gcs/webshop_datalake/webshop_history.csv")
df.head()
You may also now want to rename the notebook (click on "Untitled" and change the name to, e.g., "Datalake Access"). After executing the cells, your notebooks should look like this:
A data scientist might now be interested in the mean value of sales per product. This would be the classical way in pandas:
df.groupby("product_name").sales_value.mean()
The output should look like this:
Within the cloud environment you pay by use. Since the DataProc Cluster costs ~3$ per day, we now want to tear it down. Unfortunately, DataProc does not allow "stopping" the cluster (like cloud SQL did). Thus, we'll need to delete the cluster to avoid costs.
Save your Juypter Notebook:
Close all Jupyter Notebook browser tabs.
Go to the cluster details page and hit "Delete":
You may want to make sure that there are no high running costs in GCP throughout this course. You could (tomorrow) check in the billing product, if there are really no running costs.
Click on "Billing" in the menu and hit "Go to linked billing account".
Select "Reports" and set "Group by" to "SKU" (on the right side). De-select "Promotions and others". You should see (few) dollars spent on DataProc depending on how long the instance has been running. DataProc pricing consists of a DataProc fee and the fees for virtual machines ("Compute Engine") and hard-drives (disks) attached to these machines as well as potential costs for network traffic.
Congratulations, you completed the second lab in our big data journey. This lab was still not really big data related in terms of analytics, but you used technologies that are capable of holding (cloud storage) and processing (DataProc) very large amounts of data for a reasonable price, i.e. efficiently.