Lab "Query and Extraction: Data Lake Queries from PySpark SQL"

In this use case we want to query our data lake in an ad-hoc manner and learn how to create views of our batch processing results.

Goal

Reading from the (structured) data lake in an SQL-like manner and storing the results in a new table.

What you'll implement

Set up a DataProc cluster which is capable of executing the sqoop job (sqoop is used for the next lab). SEE PREREQUISITES!
Open a Jupyter Notebook and access the data lake via SQL (and Hive).
Writing the ad-hoc query results back to the data lake and Hive-metastore.

gcloud dataproc clusters create pk-hadoop --region us-central1 --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/cloud-sql-proxy/cloud-sql-proxy.sh,gs://goog-dataproc-initialization-actions-us-central1/sqoop/sqoop.sh --metadata "hive-metastore-instance=pk-bigdata:us-central1:pk-sql" --scopes sql-admin --properties=hive:hive.metastore.warehouse.dir=gs://pk-gcs/hive-warehouse --enable-component-gateway --optional-components ANACONDA,JUPYTER --project pk-bigdata

Cluster creation will take 2-3 minutes.

Results

You created a Hadoop cluster (DataProc) which provides functionality for accessing cloud SQL (initialization action "cloud-sql-proxy") as well as the Jupyter notebook components being necessary for PySpark..

Opening the Jupyter Notebook

Please open a Jupyter Notebook and save it under "BigData - Query and Extraction - PySpark SQL access to the Data Lake".

Ad-Hoc Querying

As you already know, the variable "spark" allows us to access the Spark cluster via the Spark SQL API. Thus, we can execute our query by simply entering:

# read from data lake (via Hive)
df = spark.sql("SELECT region, avg(sales_value) as average_sales FROM sales GROUP BY region")
df.toPandas()

Writing the results back to the data lake

The following command furthermore allows us to store the results in a newly generated table:

# write to data lake and create a new table
df.write.mode("overwrite").saveAsTable("sales_statistics_region")

Creating a view in PySpark

Comparable to the BigQuery lab, you can also use Spark with Hive to create views. This will result in always up-to-date statistics for our regions, however, querying the view is slower than a persisted table (like sales_statistics_region) we created above.

# create a view
spark.sql("CREATE VIEW sales_statistics_region_view AS SELECT region, avg(sales_value) as average_sales FROM sales GROUP BY region")

Reading the view:

# read from the view
df_stat = spark.sql("SELECT * FROM sales_statistics_region_view")
df_stat.toPandas()

This is how your notebook should look like now:

Results

You accessed the data lake via SQL and created (1) a new table and (2) a Spark view.

Please do not delete your cluster and SQL instance before completing the next lab!

You now know how to interact with the data lake from PySpark (we did this before!) and are familiar with generating new tables and views in PySpark.