Lab "Query and Extraction: Data Lake Queries from BigQuery"

In this use case we want to perform an exemplary ad-hoc query which requires access to the data lake using BigQuery.

Goal

Setting up a "federated" connection between BigQuery and the Data Lake and using this connection for ad-hoc queries..

What you'll implement

Set up a connection to the data lake from BigQuery
Use the BigQuery UI to executed standard SQL queries.
Save your query and create a view.

Creating the table

Please open "BigQuery" in the cloud console. From previous labs, you should have the "example_dwh" dataset available:

Now you can create a new table in that dataset:

In order to have the schema immediately available, we want to access our sales data in the data lake which have been stored as Parquet files (see lab "RDBMS to GCS (Avro/Parquet) with sqoop"). Please browse to the folder "sales_from_sqoop_parquet" and select the Parquet-file within that folder. The file format should automatically switch to "Parquet":

Set the table type to "External table". This avoids BigQuery copying the data from the Parquet file into BigQueries internal representation. There is rather a "live" connection, i.e. when queries are executed in the BigQuery UI, they are transferred to the original environment (i.e. cloud storage and Parquet).

The table name could be "sales_in_datalake".

Results

You created a connection from BigQuery to the data lake.
This connection can now be used for ad-hoc querying the sales table.

Creating the Query

Select the newly created table and click on "Query table":

Try to formulate the query "mean sales_value per region" on your own (solution on the next page).

Results

You queried the data lake using standard SQL from BigQuery.

SQL Query

This should be your query:

SELECT region, avg(sales_value)
FROM `pk-bigdata.example_dwh.sales_in_datalake` 
GROUP BY region;

And this should be the result:

Results

You now know how to query structured data in your data lake via BigQuery.

Saving queries

In case of repeating queries, you can save these in BigQuery:

Saving queries can furthermore be shared with other data scientists within the project. You'll find your saved queries under "Saved queries".

View generation

If you want to have the statistics ("average sales per region") available for later reuse or inclusion into BI-tools (Data Studio, PowerBI, ...) without the need to calculate them again, you can store your query results as a view. Click on "Save view" and call the view, e.g., sales_statistics_region:

The view should be shown in your dataset.

Results

You now know how to save queries and store them as views for later re-use in your big data architecture and development teams.

Congratulations, now know a further access-mechanism to big data being stored in a data lake.