In this lab, we want to make use of the PubSub service and learn how to create a history of messages in our data lake using a DataFlow template. This is the flow of data:

What you'll implement

Relying on the example topic "pk-topic"

We will be using the same topic as in the last lab (Using PubSub from Python and the Service Account Concept). However, this time we'll need to utilize a fixed schema (i.e. message structure) in order to allow DataFlow the conversion of messages to analyzable data.

Message format

Publishing sample messages

You may already want to push some messages to the topic varying the "product" and "sales" values (they will be in the pipeline when our job starts):

Results

  1. You identified the topic you want to rely on and published some sample messages to it.

Service account rights

Please open "IAM & Admin" → "IAM" in the cloud console. Find the user xxx_dataflow_xxx:

Edit the user's privileges and add the role "Project" → "Editor":

This should be the result:

Job creation

Please open "DataFlow" → "Jobs" in the cloud console:

We will start with a job from a template. Please create a new job:

The following settings are recommended:

All other parameters can stay default. This should be the input form:

Please note that the history will be created in 5-minute windows.

Click on "Run Job".

Results

  1. You now have a rough understanding of the service account concept and know how to create these.
  2. You know how to create JSON key files and download one for our newly created account.

Job Setup

Setting up the job takes a few minutes.

When the job is running, you should see a similar screen:

Pushing messages into the topic

Please send some messages from either the UI or via a Jupyter Notebook to the topic.

Checking the Job Graph and Job Metrics

You should see that your job processes messages in the "Job Graph":

Under "Job Metrics" you can check that one worker is required for this job and it should be running:

In case of higher workloads, DataFlow will automatically scale up (if you allow this in the job configuration).

Checking cloud storage

When the first 5-minute window is finished, you should see a folder in your GCS bucket:

It should contain a file with a timestamp:

The file should contain the message data (one line per message):

Results

  1. You checked the execution of a realtime job in DataFlow.
  2. You inspected the results of the job.

Stopping the job

In order to avoid unnecessary costs you should stop jobs in DataFlow when they are not in use:

You can select "Cancel" In the popup window to make sure that the job is immediately stopped:

Results

  1. You finished this lab by shutting down the running job in DataFlow.

You are now able to make use of a big data-ready realtime processing framework in the cloud. DataFlow can be used for various scenarios (we'll take a look at specifying jobs in an SQL-based manner later). However, natively programming DataFlow (=Apache Beam) is pretty tough but is also very powerful in terms of creating auto-scaling pipelines.