This is again our use case: a simple webshop for which we now want to generate a very simple BI architecture, i.e. a data warehouse in the "traditional" manner.
Loading example source data from cloud storage to a DWH Staging area, transforming the data to a star schema and providing the results in a DWH-like manner.
Please go to BigQuery in the console:
If not already selected, please select your project in the "Explorer":
We now want to "Add Data":
Select "Google Cloud Storage" as data source:
Select "Browse" to be able to navigate to your bucket:
Navigate to the webshop_datalage and select the csv-file:
Now, we need to create a new dataset in our project, before being able to upload the data.
Please call the dataset "example_dwh" and make it a multi-regional dataset in the US. Leave all other settings as default and hit "Create dataset":
You should be able to select your dataset now and furthermore be allowed to enter a table name (sales_staging) as well as select "Schema auto detection":
Please reload the "Explorer" page, such that the dataset becomes visible in your project when clicking on the arrow next to your project ID:
You can take a look at the schema (check if the data types are as expected), preview data, or query the table by clicking on the table name:
If not done in the previous step, please select the table "sales_staging" in your project's dataset "example_dwh" and hit "Query → In new tab":
A sample query is created:
Use this SQL code to derive the dimension table "product" and run the query.
-- derive the dimension table product from the source data
SELECT DISTINCT product, product_name FROM `pk-bigdata.example_dwh.sales_staging`
You should see the results in the lower part of the window:
Select "Save results" and save the query result as a new BigQuery table calles "dwh_product":
Store this new table in the same dataset:
In order to set up a productive architecture, we'd need to schedule the query to be executed, e.g. every night after uploading new data to the sales_staging table (which we'll not implement).
Select "Schedule":
Enable the required API:
Afterwards, take a look at the options you get when "creating a new scheduled query":
We can set the repeats (e.g. daily), set the destination table (e.g. our "dwh_product") and define, if we want to append or overwrite data (in our case overwrite would be necessary for the query).
We won't set up the schedule now.
Similar to the dimension table generation, your task is now to derive the sales facts from the staging table. Generate a new table dwh_sales which contains the date, product, and sales_value columns by querying the staging table.
Your query should look like this:
-- derive the sales facts from the source data
SELECT date, product, sales_value FROM `pk-bigdata.example_dwh.sales_staging`;
Your dataset "example_dwh" now should contain three tables.
Go to https://datastudio.google.com and create a new report. In the popup section "add data" select "BigQuery".
You first need to authorize "Looker Studio" (which is datastudio):
Navigate via your project and dataset to the tables "dwh_product". Add this to the report.
Allow adding the data to the report:
Rename the report to "BigData Course - Fundamentals - Sample DWH". You will notice that DataStudio added a visualization (table) already based on the input table:
First, we want to add the previously generated dimension "product" to our datastudio data sources. Please add this BigQuery table by clicking on "Daten hinzufügen" and navigating to the product dimension table. Afterwards, you should see both tables on the right side of the screen:
Select "Resource" and "manage linked data":
Add a join:
First, add a further table to the modeler and define the fields to be used within the join as shown below:
Next, click on "configure join" and define the following join condition:
Add this join to our data model by saving it (lower right part of the screen). You should see another data source:
We can now use fields from both tables (dimension and fact) to, for example, create a line chart of sales over time:
Configure this visualization as follows:
You will probably not notice any costs for BigQuery throughout this course, since you'll stay below the threshold of queries that are for free. Also, we are storing such low data volumes in our use cases currently, that the storage costs will be negligible.
Congratulations, you completed the third lab in our big data fundamentals course. This lab focused on setting up components for a more "traditional" BI architecture. This is, however, very important since the big data architecture will augment the BI architecture, but not replace it completely. Knowing how to set up "normal" ETL processes is very important for all analytics-related projects.