Lab "Fundamentals: Data Science Sandbox"

This is again our use case: a simple webshop for which we want to allow data scientists to access the data lake.

Goal

Loading example source data into cloud storage and accessing these from a Jupyter Notebook..

What you'll implement

Upload sample data to cloud storage (our sample data lake)
Setup a big data-ready computation environment (Hadoop in the cloud)
Reading the data lake using standard Python / pandas (not big data-ready)

Creation of a bucket

Please go to the storage browser:

Create a bucket:

Call it for example "your initals-gcs":

Leave all other settings as default. However, please click continue and examine the options (especially the "default storage class" has huge effects on storage costs).

In the bucket details you can create a new folder called "webshop_datalake":

Upload the file "webshop_history.csv" from the folder "Datasets" in the course materials.

Your bucket's details should now look like this:

Results

You created a bucket which we can use throughout this course.
You uploaded sample data to the bucket which we'll examine further in first, a "traditional" non-scalable manner, and second, in a scalable "big data-manner" later on.

Preparation

Please make sure that the following settings are set:

Please navigate to the "APIs" (API Gateway) page within the GCP web console and activate all recommended APIs (Management, Service Control etc.).
Please navigate to "VPC network" and activate the recommended APIs (especially Networking but also the DNS API).
Please navigate to "VPC network"

Select the "default" VPC network by clicking on it.
Navigate to "Subnets" and select the first one ("default" in "us-central1").
Click on "Edit" and activate "Private Google Access"

Cluster Creation

Please navigate to DataProc, which is GCP's Hadoop product:

Hit "create cluster":

Select "Cluster on Compute Engine":

You may want to use the following parameters:

Name: "your initials-hadoop" (e.g. pk-hadoop)
Region: leave the default region (us-central1)
Cluster mode: leave the default mode (1 master, N workers)
All other settings should also stay in the default

Scroll further down and (1) enable the checkbox "Component gateway" and (2) select Jupyter Notebook under "Optional Components":

Scroll down to the "Optional components" and select "Jupyter Notebook":

Leave the rest as default and click on "Create".

GCP now spins up three virtual machines which together form a so-called Hadoop cluster. One machine is the master (some kind of coordinating unit) and two machines are workers. If you need more "computational power", you'd just add further worker nodes. We will learn how to use this kind of parallel big data processing later. For now, we just want to get to the Jupyter Notebook and access the data lake - later we'll use the notebook to make big data computations with Hadoop (and Spark).

Results

You configured and started your first big data system, a Hadoop cluster, called "DataProc" in GCP. DataProc is the basis for many big data ingestion, transformation, and processing tasks in GCP.

Opening the Jupyter Notebook

Click onto the link to your cluster:

When clicking on "VM instances" one can see that the three virtual machines are running. We will later use this to connect to the master node.

For now, we only want to use the web interface of the Jupyter notebook. Click on "Web Interfaces" and "Juypter":

You should see a Jupyter Notebook environment:

Let's create a notebook in Cloud Storage, i.e. navigate to GCS and click "New" → "PySpark" (although we will not use the "big data tool" Spark now, let's try if this works):

Accessing the bucket and csv-file

Afterwards, you should be able to access the data lake now in another cell (change my bucket "pk-gcs" to your bucket's name). The "gs://" prefix tells pandas to access a bucket in GCP cloud storage (you'll see this access pattern often throughout this course!).

import pandas as pd
df = pd.read_csv("gs://pk-gcs/webshop_datalake/webshop_history.csv")
df.head()

Renaming the Notebook

You may also now want to rename the notebook (click on "Untitled" and change the name to, e.g., "Datalake Access"). After executing the cells, your notebooks should look like this:

Sample Aggregation

A data scientist might now be interested in the mean value of sales per product. This would be the classical way in pandas:

df.groupby("product_name").sales_value.mean()

The output should look like this:

Results

You used a Jupyter Notebook as a sample data science sandbox which allows data scientists to access a data lake.
We accessed a sample data lake being built in GCP cloud storage (with only one file in it). There are other ways, but this option is one of the most efficient solutions to creating a data lake.
You did not use a big data processing environment yet, i.e. you might also have downloaded the webshop_history file to your laptop and processed it there in a Jupyter environment. However, your setup is not far away of making use of big data technologies.
We can now shut down our cluster in order to save money (see next step).

Reasoning

Within the cloud environment you pay by use. Since the DataProc Cluster costs ~3$ per day, we now want to tear it down. Unfortunately, DataProc does not allow "stopping" the cluster (like cloud SQL did). Thus, we'll need to delete the cluster to avoid costs.

Saving and closing the notebooks

Save your Juypter Notebook:

Close all Jupyter Notebook browser tabs.

Shutting down the DataProc Cluster

Go to the cluster details page and hit "Delete":

Checking the costs for the DataProc cluster (wait one day)

You may want to make sure that there are no high running costs in GCP throughout this course. You could (tomorrow) check in the billing product, if there are really no running costs.

Click on "Billing" in the menu and hit "Go to linked billing account".

Select "Reports" and set "Group by" to "SKU" (on the right side). De-select "Promotions and others". You should see (few) dollars spent on DataProc depending on how long the instance has been running. DataProc pricing consists of a DataProc fee and the fees for virtual machines ("Compute Engine") and hard-drives (disks) attached to these machines as well as potential costs for network traffic.

Congratulations, you completed the second lab in our big data journey. This lab was still not really big data related in terms of analytics, but you used technologies that are capable of holding (cloud storage) and processing (DataProc) very large amounts of data for a reasonable price, i.e. efficiently.