In this use case we want to store the contents of our webshop table in the data lake (nightly full extract).

Goal

Ingesting structured relational data into the data lake for batch processing via a (horizontally) scalable sqoop (MapReduce) job.

What you'll implement

Creating the DataProc cluster

Please configure a cluster (enable Jupyter Notebook, select compute family "N2" with standard persistent disks limited to 50GB for worker and master nodes). Do not hit create, since we'll need a further step!

We'll need to activate the sqoop initialization action, which is not selectable via the WebUI. Thus, please click "Equivalent Command Line" and copy/paste the CLI call for cluster creation in a text editor:

We'll need to add the following CLI arguments, in order to install sqoop on the master and worker nodes:

This is the initialization action that needs to be added:

 --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/sqoop/sqoop.sh

Full command:

gcloud dataproc clusters create cluster-b1cb --enable-component-gateway --region us-central1 --no-address --master-machine-type n2-standard-2 --master-boot-disk-size 50 --num-workers 2 --worker-machine-type n2-standard-2 --worker-boot-disk-size 50 --image-version 2.2-debian12 --optional-components JUPYTER --project hdm-bi-bigdata --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/sqoop/sqoop.sh

Please open the GCP Cloud Shell (top right in the web console) and insert the cluster creation command there:

Cluster creation will take some minutes.

When cluster creation is finished, you can select the cluster in the web UI:

Open a connection to the master node which is able to start sqoop jobs:

Results

  1. You created a Hadoop cluster (DataProc) as well as the sqoop component for batch ingestion of relational data into our data lake.
  2. You learned how to set up a DataProc cluster via the cloud shell.
  3. Next, we'll use the connection to the master node and execute our sqoop data ingestion job.

Executing the job (with HDFS import)

Please open the SSH connection to the master node.

The following command starts the sqoop job (the "--m 1" option is necessary since there is no primary key defined on our table).

sqoop import --connect jdbc:mysql://<db-IP see Moodle>/<dbname, see Moodle> --userna
me <see Moodle> --password <see Moodle> --table sales --m 1

You now should see a lot of output since a MapReduce (horizontally scaling!) job is running and retrieving the data from the relational database.

When finished, the output should look like this:

We ingested 30 rows into our data lake.

Checking the results

The sqoop job puts its data into HDFS by default (it is a "traditional" tool). You can check this by executing the following command on the master node:

hadoop fs -ls

This will show one folder "sales". We can take a look into this folder with this command:

hadoop fs -ls sales/

With this command you'll be able to take a look at the data being extracted from the RDBMS:

hadoop fs -cat sales/part-m-00000

This should be the output:

Executing the job (with GCS import)

In order to import the relational data into our GCS-based data lake, we need to set the target directory explicitly to avoid HDFS. This command will ingest the input from our database into a GCS bucket (pk-gcs) and therein a folder "sales_from_sqoop" (make sure to change your initials if applicable):

sqoop import --connect jdbc:mysql://<db-IP see Moodle>/<dbname, see Moodle> --userna
me <see Moodle> --password <see Moodle> --table sales --m 1 --target-dir gs://hdm-kueppers/sales_from_sqoop

This should yield the same shell output, however your GCS bucket should now contain folder "sales_from_sqoop":

The folder contains the same files as the HDFS folder sqoop created in our first job run.

We could set an autoscaling policy to let our DataProc cluster scale-up if necessary:

This is, however, not part of our lecture.

Results

  1. You executed a sqoop (MapReduce) job with different configurations (HDFS output first and GCS output second).
  2. This setup is big data ready for heavy workloads, since MapReduce is capable of scaling out (horizontally) and if necessary, you can add worker nodes to your cluster (however, the SQL database might become a bottleneck).

If you won't continue with the next lab (which is recommended), please make sure to delete your cluster:

You may also want to clean up your GCS bucket. However, especially in the live lecture please leave the files in place.

Results

  1. You finished the lab and performed all necessary clean-up tasks.

Congratulations, you set up a "big data-traditional" and horizontally scaling ingestion pipeline using a Hadoop cluster with sqoop. You learned about initialization actions which come in handy when working with DataProc.