In this use case we want to perform an exemplary ad-hoc query which requires access to the data lake using BigQuery.
Setting up a "federated" connection between BigQuery and the Data Lake and using this connection for ad-hoc queries..
Please open "BigQuery" in the cloud console. From previous labs, you should have the "example_dwh" dataset available:
Now you can create a new table in that dataset:
In order to have the schema immediately available, we want to access our sales data in the data lake which have been stored as Parquet files (see lab "RDBMS to GCS (Avro/Parquet) with sqoop"). Please browse to the folder "sales_from_sqoop_parquet" and select the Parquet-file within that folder. The file format should automatically switch to "Parquet":
Set the table type to "External table". This avoids BigQuery copying the data from the Parquet file into BigQueries internal representation. There is rather a "live" connection, i.e. when queries are executed in the BigQuery UI, they are transferred to the original environment (i.e. cloud storage and Parquet).
The table name could be "sales_in_datalake".
Select the newly created table and click on "Query table":
Try to formulate the query "mean sales_value per region" on your own (solution on the next page).
This should be your query:
SELECT region, avg(sales_value)
FROM `pk-bigdata.example_dwh.sales_in_datalake`
GROUP BY region;
And this should be the result:
In case of repeating queries, you can save these in BigQuery:
Saving queries can furthermore be shared with other data scientists within the project. You'll find your saved queries under "Saved queries".
If you want to have the statistics ("average sales per region") available for later reuse or inclusion into BI-tools (Data Studio, PowerBI, ...) without the need to calculate them again, you can store your query results as a view. Click on "Save view" and call the view, e.g., sales_statistics_region:
The view should be shown in your dataset.
Congratulations, now know a further access-mechanism to big data being stored in a data lake.