In this use case we want to store the contents of our webshop table in the data lake (nightly full extract) using data fusion.
Ingesting structured relational data into the data lake for batch processing via a (horizontally) scalable solution underlying Cloud Data Fusion.
If necessary, please import the pipeline you created in the lab "Data Fusion".
Please add a BigQuery sink to our process:
Connect the database source such that BigQuery becomes a second sink.
Select "Properties" of the BigQuery stage
The reference name could be
webshop_sales_ingest_bq
If you executed the lab "Data Warehouse Example", you already created a BigQuery dataset:
We will ingest our data into this dataset "example_dwh". The table name could be
sales_from_datafusion_bq
All other parameters can stay default.
Finally, you should rename your pipeline, e.g. by appending a "_bq" to the name, save and export it:
In order to be able to reuse our specified pipeline, please export it to your local filesystem:
Please click "Save" and afterwards "Deploy" in the pipeline builder. The output should look like this (showing the pipeline in the lower part):
Please click "Run". The pipeline is now executed using a DataProc Hadoop cluster (i.e. it is horizontally scalable and big data-ready!). Our ingestion logic is transformed to a Spark job. One can see that a cluster is being generated by opening the DataProc overview page in the cloud console:
The cluster is provisioned (i.e. set up) - in Data Fusion you can see the current status:
If everything is fine you should see the following output. In case of errors, you need to check the "Logs":
When opening BigQuery you should see the newly created table filled with the raw data:
Please make sure to delete your data fusion instance in order to avoid high costs (in case of the live lecture please leave it running):
Congratulations, you set up a modern, cloud-based and horizontally scaling ingestion pipeline using GCP's data fusion with an automatically provisioned Hadoop cluster below (using Spark for the extraction / ingestion logic). Data Fusion is very powerful and would also allow for "data wrangling". We skipped this part here since we are still concerned with data ingestion.