In this use case we want to store the contents of our webshop table in the data lake (nightly full extract).
Ingesting structured relational data into the data lake for batch processing via a (horizontally) scalable sqoop (MapReduce) job.
Please configure a cluster (enable Jupyter Notebook, select compute family "N2" with standard persistent disks limited to 50GB for worker and master nodes). Do not hit create, since we'll need a further step!
We'll need to activate the sqoop initialization action, which is not selectable via the WebUI. Thus, please click "Equivalent Command Line" and copy/paste the CLI call for cluster creation in a text editor:
We'll need to add the following CLI arguments, in order to install sqoop on the master and worker nodes:
This is the initialization action that needs to be added:
--initialization-actions gs://goog-dataproc-initialization-actions-us-central1/sqoop/sqoop.sh
Full command:
gcloud dataproc clusters create cluster-b1cb --enable-component-gateway --region us-central1 --no-address --master-machine-type n2-standard-2 --master-boot-disk-size 50 --num-workers 2 --worker-machine-type n2-standard-2 --worker-boot-disk-size 50 --image-version 2.2-debian12 --optional-components JUPYTER --project hdm-bi-bigdata --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/sqoop/sqoop.sh
Please open the GCP Cloud Shell (top right in the web console) and insert the cluster creation command there:
Cluster creation will take some minutes.
When cluster creation is finished, you can select the cluster in the web UI:
Open a connection to the master node which is able to start sqoop jobs:
Please open the SSH connection to the master node.
The following command starts the sqoop job (the "--m 1" option is necessary since there is no primary key defined on our table).
sqoop import --connect jdbc:mysql://<db-IP see Moodle>/<dbname, see Moodle> --userna
me <see Moodle> --password <see Moodle> --table sales --m 1
You now should see a lot of output since a MapReduce (horizontally scaling!) job is running and retrieving the data from the relational database.
When finished, the output should look like this:
We ingested 30 rows into our data lake.
The sqoop job puts its data into HDFS by default (it is a "traditional" tool). You can check this by executing the following command on the master node:
hadoop fs -ls
This will show one folder "sales". We can take a look into this folder with this command:
hadoop fs -ls sales/
With this command you'll be able to take a look at the data being extracted from the RDBMS:
hadoop fs -cat sales/part-m-00000
This should be the output:
In order to import the relational data into our GCS-based data lake, we need to set the target directory explicitly to avoid HDFS. This command will ingest the input from our database into a GCS bucket (pk-gcs) and therein a folder "sales_from_sqoop" (make sure to change your initials if applicable):
sqoop import --connect jdbc:mysql://<db-IP see Moodle>/<dbname, see Moodle> --userna
me <see Moodle> --password <see Moodle> --table sales --m 1 --target-dir gs://hdm-kueppers/sales_from_sqoop
This should yield the same shell output, however your GCS bucket should now contain folder "sales_from_sqoop":
The folder contains the same files as the HDFS folder sqoop created in our first job run.
We could set an autoscaling policy to let our DataProc cluster scale-up if necessary:
This is, however, not part of our lecture.
If you won't continue with the next lab (which is recommended), please make sure to delete your cluster:
You may also want to clean up your GCS bucket. However, especially in the live lecture please leave the files in place.
Congratulations, you set up a "big data-traditional" and horizontally scaling ingestion pipeline using a Hadoop cluster with sqoop. You learned about initialization actions which come in handy when working with DataProc.