In this use case we want to store the contents of our webshop table in the data lake (nightly full extract):

Goal

Writing data from the data lake back into operational systems, e.g. an RDBMS.

What you'll implement

Data fusion setup

Please set up the data fusion instance in analogy to the lab Data Fusion (https://pkuep.github.io/pk-bigdata/batch_ingestion_datafusion).

Creating a new "data wrangling" pipeline

Please name the pipeline "export_sales_statistics".

Adding GCS (csv file) as a source

Please navigate to the folder "webshop_datalake" and select the csv-file "webshop_history":

We now want to parse this csv-file. First, we need to do the actual parsing:

And second, we need to specify the sales_value column to be of numeric type. Please select the properties of the "Wrangler" stage and set the type of the column "sales_value" to "double":

Creating the pipeline

Please create a pipeline afterwards:

Adding a "GROUP BY" transformation

Hit "Get Schema" once such that your screen looks like this:

Add the grouping logic as shown in the screenshot.

Adding the RDBMS sink

Next, we want to provide the connection to our cloud SQL server as a reference for an external RDBMS. Add a respective "Database" sink:

The properties should be set as follows:

This is how the screen should look like:

Saving and exporting the pipeline

Please save and then export your pipeline for later use. The pipeline should look similar to this:

Results

  1. You created a pipeline that is capable of integrating our data lake with "traditional" RDBMS.
  2. You furthermore learned about capabilities of data fusion in terms of Wrangling and Aggregating data. Remember: this is all pushed down to the DataProc cluster and Spark jobs within that → this is big data ready!

Pipeline Deployment

Please deploy your pipeline:

Afterwards, you should hit "Run" to execute it. This will take ~5 min.

When the pipeline run is finished, you should see your successful run:

Checking the results

When switching back to the SQL connection to our RDBMS, we should see that our batch processing results have been correctly calculated and written back to the RDBMS:

Results

  1. You checked if the pipeline run was successful and validated the results in the RDBMS.

Please delete your data fusion instance!

You now know how extract data from the data lake and integrate "traditional" RDBMS into our big data architecture. Furthermore, you are able to perform batch processing in another tool (Data Fusion).