In this use case we want to store the regularly updated image of a webcam in our data lake (based on GCS) in order to, e.g., let a machine learning algorithm identify whether it's cloudy or not. We want to achieve this task in a scalable manner, i.e. be able to add potentially thousands of webcams and still be able to guarantee performance. See and check the webcam.
Ingesting unstructured image data into the data lake for batch processing via a (vertically) scalable cloud function.
If you don't have Anaconda and Jupyter Notebook installed locally on your computer, please create a notebook in GCP's AI Platform using Vertex AI / Colab Enterprise:
Create a Python 3 notebook and and insert the following imports into the first cell:
import requests # will be used to retrieve the image via http / a URL
from IPython.display import Image # will be used to show one exemplary image in the notebook
from datetime import datetime # will be used to format the filename
Show the current image of this webcam using the following code in another cell:
url = 'https://www.kite-connection.at/weatherstation/webcam/rohrspitz.jpg' # image url
Image(url=url, width=300) # show image in notebook
Now, let's download the current image to the local disk (of our JupyterLab machine):
filename = f"webcam_{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
response = requests.get(url)
file = open(filename, "wb")
file.write(response.content)
file.close()
When updating the file browser you should see the image (which you can open in JuypterLab):
In case you are working in the cloud JupyterLab, you can access the cloud storage easily with the following code:
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket("hdm-kueppers")
blob = bucket.blob(filename)
blob.upload_from_string(response.content, content_type='image/png')
You should now be able to see this image in the cloud console (under Cloud Storage)
The notebook could now be executed, e.g. hourly, in order to create a history of webcam images. However, the notebook is only used for demo purposes and we now want to see how one can deploy such an ingestion logic to a cloud function.
Please go to cloud functions in the console:
Create a function:
We'll use the inline editor:
Set the function (=service) name to "image-ingestion-yourintials". Choose us-central1 as region and a modern Python runtime. Set "Allow public access" under "Authentication". The other parameters can be set as default. Hit create:
You are working in the file "main.py". This file holds the logic we want to deploy to our cloud function. Let's transfer the logic of the Jupyter Notebook to our method "ingest_image":
import functions_framework
import requests
from datetime import datetime
from google.cloud import storage
@functions_framework.http
def ingest_image(request):
"""Pull an image from the webcam and store it in GCS """
# pull the image
url = 'https://www.kite-connection.at/weatherstation/webcam/rohrspitz.jpg'
filename = f"webcam_fromfunction_{datetime.now().strftime('%Y%m%d_%H%M%S')}.png" # add a timestamp
response = requests.get(url)
# store it in GCS
storage_client = storage.Client()
bucket = storage_client.get_bucket("hdm-kueppers") # replace the bucket name with yours
blob = bucket.blob(filename)
blob.upload_from_string(response.content, content_type='image/png')
return 'Success'
Next, we'll need to specify that our cloud function requires some Python packages. Click on "requirements.txt" and add this code at the end of the file:
functions-framework==3.*
google-cloud-storage
requests
datetime
Your UI should now look like this:
Please rename the function entry point to "ingest_image" and save + redeploy:
Wait 1-2 minutes until the final function is deployed and then click on the URL:
The output should be "Succes" and you should now see the output file in your bucket:
Please navigate to the cloud scheduler (under "Tools") and create a job:
The job could be called "scheduled_webcam_ingest" and let's set the frequency to once per minute (* * * * *). In "Timezone" you can search for "Central European".
Hit continue and select HTTP as target. Paste the URL of your cloud function there:
You should see the result in your GCS bucket. Now, every minute a further image should be added to the bucket:
Please make sure to delete your scheduled job in the cloud scheduler:
Next, you may want to delete the files in the bucket:
Next, you should delete the cloud function by selecting it and hitting delete.
Congratulations, you set up a state-of-the-art cloud ingestion pipeline using cloud functions and scheduled its execution.