In this use case we want to extract data from our data lake:
Writing data from the data lake back into operational systems, e.g. an RDBMS.
DUE TO TECHNICAL ISSUES, THIS LAB DOES NOT WORK COMPLETELY!
gcloud dataproc clusters create pk-hadoop --region us-central1 --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/cloud-sql-proxy/cloud-sql-proxy.sh,gs://goog-dataproc-initialization-actions-us-central1/sqoop/sqoop.sh --metadata "hive-metastore-instance=pk-bigdata:us-central1:pk-sql" --scopes sql-admin --properties=hive:hive.metastore.warehouse.dir=gs://pk-gcs/hive-warehouse --enable-component-gateway --optional-components ANACONDA,JUPYTER --project pk-bigdata
Cluster creation will take 2-3 minutes.
First, we need to create the table
Please open the cloud SQL page in the cloud console and establish a connection to your relational database via the cloud shell:
First, we need to switch to the webshop database:
use webshop;
Afterwards, please create the table as follows such that we can let sqoop push data into it:
CREATE TABLE sales_statistics_region (
region VARCHAR(20),
average_sales FLOAT
);
Please navigate to the cluster's master node and open an SSH connection:
This will again be our "entry point" to the sqoop tool (remember: we added it via the initialization action). With the following command, we can write the results back to the SQL server:
sqoop export --connect jdbc:mysql://localhost/webshop --username root --table sales_statistics_region --hcatalog-table sales_statistics_region
The parameter "hcatalog-table" determines which table from the data lake should be read. The parameter "table" directs towards the RDBMS's target table.
Please do not delete your cluster and SQL instances!
You now know how extract data from the data lake and integrate "traditional" RDBMS into our big data architecture.