Lab "Batch Ingestion: Cloud Data Fusion"

In this use case we want to store the contents of our webshop table in the data lake (nightly full extract) using data fusion.

Goal

Ingesting structured relational data into the data lake for batch processing via a (horizontally) scalable solution underlying Cloud Data Fusion.

What you'll implement

Set up a Cloud Data Fusion instance in order to develop and deploy a basic ingestion pipeline.
Learn about the different connector types for sources and sinks in data fusion.
Deploy a data fusion job that extracts the relational data and ingests them into a serverless data warehouse's (BigQuery) dataset.

Importing the pipeline

If necessary, please import the pipeline you created in the lab "Data Fusion".

Adding the sink

Please add a BigQuery sink to our process:

Connect the database source such that BigQuery becomes a second sink.

Select "Properties" of the BigQuery stage

The reference name could be

webshop_sales_ingest_bq

If you executed the lab "Data Warehouse Example", you already created a BigQuery dataset:

We will ingest our data into this dataset "example_dwh". The table name could be

sales_from_datafusion_bq

All other parameters can stay default.

Finally, you should rename your pipeline, e.g. by appending a "_bq" to the name, save and export it:

Results

You added a sink (a table in a BigQuery dataset) to the pipeline, configured it, and connected it with the source.

Export and save pipeline

In order to be able to reuse our specified pipeline, please export it to your local filesystem:

Results

You saved your pipeline as a json file on your local machine.

Save and Deploy the Pipeline

Please click "Save" and afterwards "Deploy" in the pipeline builder. The output should look like this (showing the pipeline in the lower part):

Execute the Pipeline

Please click "Run". The pipeline is now executed using a DataProc Hadoop cluster (i.e. it is horizontally scalable and big data-ready!). Our ingestion logic is transformed to a Spark job. One can see that a cluster is being generated by opening the DataProc overview page in the cloud console:

The cluster is provisioned (i.e. set up) - in Data Fusion you can see the current status:

If everything is fine you should see the following output. In case of errors, you need to check the "Logs":

Checking BigQuery

When opening BigQuery you should see the newly created table filled with the raw data:

Results

You deployed a pipeline with a further sink which is now production-ready and could be scheduled regularly.
This pipeline is big data-ready, however you need to check the costs for (1) pipeline development and (2) execution costs of Dataproc.

Please make sure to delete your data fusion instance in order to avoid high costs (in case of the live lecture please leave it running):

Results

You finished the lab and performed all necessary clean-up tasks.

Congratulations, you set up a modern, cloud-based and horizontally scaling ingestion pipeline using GCP's data fusion with an automatically provisioned Hadoop cluster below (using Spark for the extraction / ingestion logic). Data Fusion is very powerful and would also allow for "data wrangling". We skipped this part here since we are still concerned with data ingestion.