Lab "Fundamentals: Data Warehouse Example"

This is again our use case: a simple webshop for which we now want to generate a very simple BI architecture, i.e. a data warehouse in the "traditional" manner.

Goal

Loading example source data from cloud storage to a DWH Staging area, transforming the data to a star schema and providing the results in a DWH-like manner.

What you'll implement

Load sample data from cloud storage to BigQuery (could be automated in a batch run)
Transform the data into a very simple star schema.
Reading the DWH tables from Jupyter Notebook.

Please navigate to "Cloud Storage" in the console and create a bucket (name it "hdm-yourlastname", change to your last name):

Click on "Create":

Upload the file webshop_history.csv (see Moodle) to your bucket:

Data upload and dataset creation

Please go to BigQuery in the console:

If not already selected, please select our project in the "Explorer":

We now want to "Add Data":

Call it wh_yourlastname. Next, select "Create table" using the menu next to your dataset's name:

Select "Google Cloud Storage" as data source and set the path to your webshop_history.csv in your bucket:

You can use "Browse" to be able to navigate to your bucket:

Navigate to the webshop_datalake and select the csv-file:

Please do not forget to set Schema auto detection.

First data inspection

You should be able to click on the table sales in the navigation menu:

You can execute a sample SQL query by clicking on the Query button:

Results

You created a dataset for our sample data warehouse.
You created a table from a csv file in the data lake. This step could be automated and new data, for example, be appended to the table.
You took a look at the data via the "preview" function and inspected the table's schema for correctness.

Querying the staging table for products

Use this SQL code to derive the dimension table "product" and run the query.

-- derive the dimension table product from the source data
SELECT DISTINCT product, product_name FROM sales

You should see the results in the lower part of the window:

Saving the results as a dimension table "product"

Select "Save results" and save the query result as a new BigQuery table calles "dwh_product":

Store this new table in the same dataset:

Scheduling the query

In order to set up a productive architecture, we'd need to schedule the query to be executed, e.g. every night after uploading new data to the sales_staging table (which we'll not implement).

Select "Schedule":

We can set the repeats (e.g. daily), set the destination table (e.g. our "dim_product") and define, if we want to append or overwrite data (in our case overwrite would be necessary for the query).

We won't set up the schedule now.

Results

You created a query to derive a dimension table from a flat (de-normalized) table.
You saved this table and thus built the first part of our exemplary simplified star schema.
You learned how to schedule queries for operative use.

Querying the staging table for sales transactions (facts)

Similar to the dimension table generation, your task is now to derive the sales facts from the staging table. Generate a new table dwh_sales which contains the date, product, and sales_value columns by querying the staging table.

Results

After having completed this task you added another part to our simple star schema.

Query

Your query should look like this:

-- derive the sales facts from the source data
SELECT date, product, sales_value FROM `hdm-bi-bigdata.wh_kueppers.sales`;

After creating the fact_sales as a table "wh_kueppers" you should see three tables:

Results

You now completed the simplified star schema.

BigQuery connection

Please open your sample notebook from the last lab. Since the notebook is running in the same project as our bigquery dataset, we can immediately connect to the dataset:

import pandas_gbq

# Parameter
PROJECT_ID = 'hdm-bi-bigdata'
QUERY = """
    SELECT *
    FROM `hdm-bi-bigdata.wh_kueppers.sales`
    LIMIT 10
"""

df = pandas_gbq.read_gbq(QUERY, project_id=PROJECT_ID)

df

Notes on BigQuery

You will probably not notice any costs for BigQuery throughout this course, since you'll stay below the threshold of queries that are for free. Also, we are storing such low data volumes in our use cases currently, that the storage costs will be negligible.

Furthermore you might have noticed that we are not connecting BigQuery to our MySQL database. BigQuery is a serverless data warehouse — it doesn't actively pull data from other systems but instead loads data from sources like Google Cloud Storage (GCS), Cloud SQL (via federated queries), other BigQuery datasets, or external sources through External Tables, BigLake Tables, or Connections. There is no native direct connection for MySQL running on a VM. To use that data in BigQuery, you need to build an intermediate step like exporting to GCS or using a data pipeline.

Congratulations, you completed the third lab in our big data fundamentals course. This lab focused on setting up components for a more "traditional" BI architecture. This is, however, very important since the big data architecture will augment the BI architecture, but not replace it completely. Knowing how to set up "normal" ETL processes is very important for all analytics-related projects.