Lab "Query and Extraction: Extracting Data with Data Fusion"

In this use case we want to store the contents of our webshop table in the data lake (nightly full extract):

Goal

Writing data from the data lake back into operational systems, e.g. an RDBMS.

What you'll implement

Set up a DataProc cluster which is capable of executing the sqoop job (sqoop is used for the next lab). SEE PREREQUISITES!
Writing the results of our batch processing back into the source system (RDBMS) using Data Fusion.

Data fusion setup

Please set up the data fusion instance in analogy to the lab Data Fusion (https://pkuep.github.io/pk-bigdata/batch_ingestion_datafusion).

Creating a new "data wrangling" pipeline

Please name the pipeline "export_sales_statistics".

Adding GCS (csv file) as a source

Please navigate to the folder "webshop_datalake" and select the csv-file "webshop_history":

We now want to parse this csv-file. First, we need to do the actual parsing:

And second, we need to specify the sales_value column to be of numeric type. Please select the properties of the "Wrangler" stage and set the type of the column "sales_value" to "double":

Creating the pipeline

Please create a pipeline afterwards:

Adding a "GROUP BY" transformation

Hit "Get Schema" once such that your screen looks like this:

Add the grouping logic as shown in the screenshot.

Adding the RDBMS sink

Next, we want to provide the connection to our cloud SQL server as a reference for an external RDBMS. Add a respective "Database" sink:

The properties should be set as follows:

Label: Webshop-Statistics-Sink
Reference name: webshop_statistics_sink
Plugin name: cloudsql-mysql (specified in deploying the driver, see lab https://pkuep.github.io/pk-bigdata/batch_ingestion_datafusion)
Connection string: jdbc:mysql://google/webshop?cloudSqlInstance=pk-bigdata:us-central1:pk-sql&socketFactory=com.google.cloud.sql.mysql.SocketFactory&useSSL=False
Table name: sales_statistics_region
Columns (of the table being exported to): region + average_sales
Credentials: Username "root"
Password: should not be set

This is how the screen should look like:

Saving and exporting the pipeline

Please save and then export your pipeline for later use. The pipeline should look similar to this:

Results

You created a pipeline that is capable of integrating our data lake with "traditional" RDBMS.
You furthermore learned about capabilities of data fusion in terms of Wrangling and Aggregating data. Remember: this is all pushed down to the DataProc cluster and Spark jobs within that → this is big data ready!

Pipeline Deployment

Please deploy your pipeline:

Afterwards, you should hit "Run" to execute it. This will take ~5 min.

When the pipeline run is finished, you should see your successful run:

Checking the results

When switching back to the SQL connection to our RDBMS, we should see that our batch processing results have been correctly calculated and written back to the RDBMS:

Results

You checked if the pipeline run was successful and validated the results in the RDBMS.

Please delete your data fusion instance!

You now know how extract data from the data lake and integrate "traditional" RDBMS into our big data architecture. Furthermore, you are able to perform batch processing in another tool (Data Fusion).