Lab "Fundamentals: Simple Analytics Pipeline"

This is our use case: a simple webshop for which we want to provide the management a sales report per region.

Goal

Loading example source data into a relational database (in the cloud) and aggregating these visually in a dashboard.

What you'll implement

Connecting to a SQL server via the cloud shell.
Retrieving sample data from a Jupyter Notebook.

Navigate to Compute Engine (GCP's product for managing virtual machines). You should be able to see the running "jump host" that allows us to connect to a MySQL database running on another virtual machine (in the bwcloud).

Please click on SSH to connect to a shell of this machine.

Connecting to the database server

First, we need to connect to the database. A port forwarding tunnl is already set up to the remote machine. Please enter the following command and the password which is provided in Moodle afterwards:

mysql -h localhost -P 3306 -u hdm -p

You should see this prompt:

Switching to the hdm database

The following commands require some basic SQL understanding which is required for this course.

Switch to the webshop database by entering the following command into the shell:

use hdm;

Reading the sales table

Select data from the sales table as follows:

select * from sales;

Results

You took a look at one very simple data source for our analytics architecture.

Creating a Jupyter Notebook

Navigate to "Vertex AI" - "Colab Enterprise":

Create a notebook (call it your last name):

Connect your notebook to a runtime environment:

You should be able to select an existing runtime.

Install pymysql using this command in the first cell:

!pip install pymysql

This should be the output after running the cell:

You should be able to load the sales table's content into a DataFrame:

import pandas as pd
from sqlalchemy import create_engine

DB_USER = 'siehe Moodle'
DB_PASSWORD = 'siehe Moodle'
DB_HOST = 'siehe Moodle'   # interne IP der MySQL-VM
DB_PORT = 3306
DB_NAME = 'siehe Moodle'

# SQLAlchemy Connection-URL 
connection_url = f"mysql+pymysql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
engine = create_engine(connection_url)

# Select statement
df = pd.read_sql("SELECT * FROM sales", con=engine)
df

You should see this result (first 10 rows only shown):

Results

You now set up the whole (very simple) analytics architecture: one source system which is directly queried by a jupyter notebook.
Now you are set to implement descriptive analytics requirements which we'll do in the next section.

Requirements

Your task is to generate a simple overview showing the total sales value per Region and Product.

Sample Solution using plotly express

Please install plotly express by executing

!pip install plotly-express

We can create a very simple visualization by using a treemap:

import plotly.express as px
px.treemap(df, path=['region', 'product_name'], values='sales_value')

Output:

Results

Our first very simple data architecture is set up.
This architecture only allows descriptive analytics, no historization and is not capable of real "big data".
For Business Intelligence and Big Data analytics, we'll need more sophisticated architectures.
We will use this database as a data source for our big data analytics. Although the volume is very low, we'll assume that the data sources will increase in volume and/or velocity a lot, i.e. requiring "real" big data architectures.

Congratulations, you succeeded in the first lab of our big data journey. This lab was not really touching big data systems but was intended to deepen your cloud knowledge, take your fears of opening and using a command shell (which is very simple in GCP), and connecting to sample source data.