Lab "Query and Extraction: Extracting Data with sqoop"

In this use case we want to extract data from our data lake:

Goal

Writing data from the data lake back into operational systems, e.g. an RDBMS.

What you'll implement

Set up a DataProc cluster which is capable of executing the sqoop job (sqoop is used for the next lab).
Writing the results of our batch processing back into the source system (RDBMS).

Results

You created a Hadoop cluster (DataProc) which provides functionality for sqoop and Jupyter Notebooks with spark.

Creating the sales_statistics_region table in the RDBMS

First, we need to create the table in DBeaver for your user (hint: the hdm user already has this table set up, you can insert/delete with the hdm-user into this table). Please create it as follows such that we can let sqoop push data into it:

CREATE TABLE sales_statistics_region (
region VARCHAR(20),
average_sales FLOAT
);

Important: If you haven't executed the lab that writes the table sales_statistics_region to the Hive data warehouse, you can use the same command within beeline and insert sample data manually.

Opening the master node's shell

Please navigate to the cluster's master node and open an SSH connection:

This will again be our "entry point" to the sqoop tool (remember: we added it via the initialization action). With the following command, we can write the results back to the SQL server:

sqoop export –connect jdbc:mysql://<db-IP see Moodle>/<dbname, see Moodle> --username <see Moodle> --password <see Moodle> --table sales_statistics_region --hcatalog-table sales_statistics_region

The parameter "hcatalog-table" determines which table from the Hive data lake should be read. The parameter "table" directs towards the RDBMS's target table.

Results

You learned how to extract data from the data lake via sqoop and push it back into RDBMS.

Please delete your cluster.

You now know how extract data from the data lake and integrate "traditional" RDBMS into our big data architecture.