In this use case we want to extract data from our data lake:

Goal

Writing data from the data lake back into operational systems, e.g. an RDBMS.

DUE TO TECHNICAL ISSUES, THIS LAB DOES NOT WORK COMPLETELY!

What you'll implement

gcloud dataproc clusters create pk-hadoop --region us-central1 --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/cloud-sql-proxy/cloud-sql-proxy.sh,gs://goog-dataproc-initialization-actions-us-central1/sqoop/sqoop.sh --metadata "hive-metastore-instance=pk-bigdata:us-central1:pk-sql" --scopes sql-admin --properties=hive:hive.metastore.warehouse.dir=gs://pk-gcs/hive-warehouse --enable-component-gateway --optional-components ANACONDA,JUPYTER --project pk-bigdata

Cluster creation will take 2-3 minutes.

Results

  1. You created a Hadoop cluster (DataProc) which provides functionality for accessing cloud SQL (initialization action "cloud-sql-proxy") as well as the Jupyter notebook components being necessary for PySpark..

Creating the sales_statistics_region table in the RDBMS

First, we need to create the table

Please open the cloud SQL page in the cloud console and establish a connection to your relational database via the cloud shell:

First, we need to switch to the webshop database:

use webshop;

Afterwards, please create the table as follows such that we can let sqoop push data into it:

CREATE TABLE sales_statistics_region (
region VARCHAR(20),
average_sales FLOAT
);

Opening the master node's shell

Please navigate to the cluster's master node and open an SSH connection:

This will again be our "entry point" to the sqoop tool (remember: we added it via the initialization action). With the following command, we can write the results back to the SQL server:

sqoop export --connect jdbc:mysql://localhost/webshop --username root --table sales_statistics_region --hcatalog-table sales_statistics_region

The parameter "hcatalog-table" determines which table from the data lake should be read. The parameter "table" directs towards the RDBMS's target table.

Results

  1. You learned how to extract data from the data lake via sqoop and push it back into RDBMS.

Please do not delete your cluster and SQL instances!

You now know how extract data from the data lake and integrate "traditional" RDBMS into our big data architecture.