Lab "Realtime Analytics: PubSub to Cloud Storage with DataFlow"

In this lab, we want to make use of the PubSub service and learn how to create a history of messages in our data lake using a DataFlow template. This is the flow of data:

What you'll implement

Create an automatically scaling job in DataFlow that is capable of analyzing messages in a PubSub-topic in a scalable real-time manner. We'll rely on the "tumbling window" method and process the messages as micro-batches using the SQL-like syntax for DataFlow. This results in respectively executable jobs that can scale automatically to heavier workloads. Please note: we won't learn how to code DataFlow pipelines ourselves since the technology behind (Apache Beam) is pretty complex.
We will deploy the job and inspect the results.

Relying on the example topic "pk-topic"

We will be using the same topic as in the last lab (Using PubSub from Python and the Service Account Concept). However, this time we'll need to utilize a fixed schema (i.e. message structure) in order to allow DataFlow the conversion of messages to analyzable data.

Message format

Publishing sample messages

You may already want to push some messages to the topic varying the "product" and "sales" values (they will be in the pipeline when our job starts):

Results

You identified the topic you want to rely on and published some sample messages to it.

Service account rights

Please open "IAM & Admin" → "IAM" in the cloud console. Find the user xxx_dataflow_xxx:

Edit the user's privileges and add the role "Project" → "Editor":

This should be the result:

Job creation

Please open "DataFlow" → "Jobs" in the cloud console:

We will start with a job from a template. Please create a new job:

The following settings are recommended:

Job name: pk-ingest-pubsub
Location: us-central1
Dataflow template: Pub/Sub to Text Files on Cloud Storage
Input Pub/Sub topic: projects/pk-bigdata/topics/pk-pubsub
Output file directory on cloud storage: gs://pk-gcs/dataflow_history_pubsub/
Output filename prefix: pubsub-topic
Temporary location: gs://pk-gcs/dataflow_temp

All other parameters can stay default. This should be the input form:

Please note that the history will be created in 5-minute windows.

Click on "Run Job".

Results

You now have a rough understanding of the service account concept and know how to create these.
You know how to create JSON key files and download one for our newly created account.

Job Setup

Setting up the job takes a few minutes.

When the job is running, you should see a similar screen:

Pushing messages into the topic

Please send some messages from either the UI or via a Jupyter Notebook to the topic.

Checking the Job Graph and Job Metrics

You should see that your job processes messages in the "Job Graph":

Under "Job Metrics" you can check that one worker is required for this job and it should be running:

In case of higher workloads, DataFlow will automatically scale up (if you allow this in the job configuration).

Checking cloud storage

When the first 5-minute window is finished, you should see a folder in your GCS bucket:

It should contain a file with a timestamp:

The file should contain the message data (one line per message):

Results

You checked the execution of a realtime job in DataFlow.
You inspected the results of the job.

Stopping the job

In order to avoid unnecessary costs you should stop jobs in DataFlow when they are not in use:

You can select "Cancel" In the popup window to make sure that the job is immediately stopped:

Results

You finished this lab by shutting down the running job in DataFlow.

You are now able to make use of a big data-ready realtime processing framework in the cloud. DataFlow can be used for various scenarios (we'll take a look at specifying jobs in an SQL-based manner later). However, natively programming DataFlow (=Apache Beam) is pretty tough but is also very powerful in terms of creating auto-scaling pipelines.