We have setup our Snowplow pipeline on GCP. We are able to Deploy the collector on Cloud Run but still running the Repeater and Mutator on Compute Engine VM. Could someone please suggest how can we deploy Repeater and Mutator on Cloud Run or any other GCP managed service rather than Compute Engine VM.
Could I ask why you want to run these on Cloud Run (vs a VM)?
Both haven’t really been built with Cloud Run in mind so I think you would need to write additional code to do quite a bit including:
- wrappers around both services in order to move from PubSub pull delivery to push delivery.
- possible state serialisation of mutator state and additional throttling
- accomodating idempotent operations in Cloud Run
Hi Mike, the reason we want to move these to Cloud Run is as the Cloud Run platform is a server less GCP managed platform which we think will take care of the scalability rather than us taking care of it explicitly.
That makes sense.
Both of these components (for the most part) shouldn’t really need to scale at all and benefit from having PubSub pull semantics rather than having to revert to push. The mutator itself is only performing occasional, depending on how often you update schemas, DDL operations so can run on a very small VM instance. The repeater will be inserting data, but typically only data that couldn’t be sunk to BigQuery in the first instance (mostly due to mutation lag).
Although you could move this to Cloud Run it’d be a significant refactor and you would take a performance hit as CR would need to process each message individually (could be a problem for the mutator from a state point of view, and the repeater from a latency point of view). Due to the concurrency limits around Cloud Run you would end up with a solution that would likely need more configuring than an equivalent always on VM.
Thanks for the response Mike. We also are looking for the option where the service that is being used for Repeater or mutator should be shut down when not in use. Also, we need to avoid maintaining any infra like a VM ourselves. Would you suggest using App Engine instead of CR?
The mutator at least is stateful - so if you plan on restarting it on a regular basis you’d need to be storing and reloading that state somehow.
Cloud Run doesn’t support pull semantics, which both the mutator and repeater rely on (rather than push only support in Cloud Run). If you really want to avoid running a VM you could likely use the Docker containers in App Engine flex combined with manual scaling to ensure that there’s always at least 1 instance of both services running.