In this lab, we want to make use of the PubSub service and learn how to create a history of messages in our data lake using a DataFlow template. This is the flow of data:
We will be using the same topic as in the last lab (Using PubSub from Python and the Service Account Concept). However, this time we'll need to utilize a fixed schema (i.e. message structure) in order to allow DataFlow the conversion of messages to analyzable data.
You may already want to push some messages to the topic varying the "product" and "sales" values (they will be in the pipeline when our job starts):
Please open "IAM & Admin" → "IAM" in the cloud console. Find the user xxx_dataflow_xxx:
Edit the user's privileges and add the role "Project" → "Editor":
This should be the result:
Please open "DataFlow" → "Jobs" in the cloud console:
We will start with a job from a template. Please create a new job:
The following settings are recommended:
All other parameters can stay default. This should be the input form:
Please note that the history will be created in 5-minute windows.
Click on "Run Job".
Setting up the job takes a few minutes.
When the job is running, you should see a similar screen:
Please send some messages from either the UI or via a Jupyter Notebook to the topic.
You should see that your job processes messages in the "Job Graph":
Under "Job Metrics" you can check that one worker is required for this job and it should be running:
In case of higher workloads, DataFlow will automatically scale up (if you allow this in the job configuration).
When the first 5-minute window is finished, you should see a folder in your GCS bucket:
It should contain a file with a timestamp:
The file should contain the message data (one line per message):
In order to avoid unnecessary costs you should stop jobs in DataFlow when they are not in use:
You can select "Cancel" In the popup window to make sure that the job is immediately stopped:
You are now able to make use of a big data-ready realtime processing framework in the cloud. DataFlow can be used for various scenarios (we'll take a look at specifying jobs in an SQL-based manner later). However, natively programming DataFlow (=Apache Beam) is pretty tough but is also very powerful in terms of creating auto-scaling pipelines.