In this use case we want to query our data lake in an ad-hoc manner and learn how to create views of our batch processing results.
Reading from the (structured) data lake in an SQL-like manner and storing the results in a new table.
gcloud dataproc clusters create pk-hadoop --region us-central1 --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/cloud-sql-proxy/cloud-sql-proxy.sh,gs://goog-dataproc-initialization-actions-us-central1/sqoop/sqoop.sh --metadata "hive-metastore-instance=pk-bigdata:us-central1:pk-sql" --scopes sql-admin --properties=hive:hive.metastore.warehouse.dir=gs://pk-gcs/hive-warehouse --enable-component-gateway --optional-components ANACONDA,JUPYTER --project pk-bigdata
Cluster creation will take 2-3 minutes.
Please open a Jupyter Notebook and save it under "BigData - Query and Extraction - PySpark SQL access to the Data Lake".
As you already know, the variable "spark" allows us to access the Spark cluster via the Spark SQL API. Thus, we can execute our query by simply entering:
# read from data lake (via Hive)
df = spark.sql("SELECT region, avg(sales_value) as average_sales FROM sales GROUP BY region")
df.toPandas()
The following command furthermore allows us to store the results in a newly generated table:
# write to data lake and create a new table
df.write.mode("overwrite").saveAsTable("sales_statistics_region")
Comparable to the BigQuery lab, you can also use Spark with Hive to create views. This will result in always up-to-date statistics for our regions, however, querying the view is slower than a persisted table (like sales_statistics_region) we created above.
# create a view
spark.sql("CREATE VIEW sales_statistics_region_view AS SELECT region, avg(sales_value) as average_sales FROM sales GROUP BY region")
Reading the view:
# read from the view
df_stat = spark.sql("SELECT * FROM sales_statistics_region_view")
df_stat.toPandas()
This is how your notebook should look like now:
Please do not delete your cluster and SQL instance before completing the next lab!
You now know how to interact with the data lake from PySpark (we did this before!) and are familiar with generating new tables and views in PySpark.