Running Repeater and Mutator on Serverless Platform

Abhishek_Singh · September 28, 2021, 4:16am

Hi Team,

We have setup our Snowplow pipeline on GCP. We are able to Deploy the collector on Cloud Run but still running the Repeater and Mutator on Compute Engine VM. Could someone please suggest how can we deploy Repeater and Mutator on Cloud Run or any other GCP managed service rather than Compute Engine VM.

Thanks,
Abhishek

mike · September 28, 2021, 4:59am

Could I ask why you want to run these on Cloud Run (vs a VM)?

Both haven’t really been built with Cloud Run in mind so I think you would need to write additional code to do quite a bit including:

wrappers around both services in order to move from PubSub pull delivery to push delivery.
possible state serialisation of mutator state and additional throttling
accomodating idempotent operations in Cloud Run

Abhishek_Singh · September 29, 2021, 11:45am

Hi Mike, the reason we want to move these to Cloud Run is as the Cloud Run platform is a server less GCP managed platform which we think will take care of the scalability rather than us taking care of it explicitly.

mike · September 29, 2021, 10:39pm

That makes sense.

Both of these components (for the most part) shouldn’t really need to scale at all and benefit from having PubSub pull semantics rather than having to revert to push. The mutator itself is only performing occasional, depending on how often you update schemas, DDL operations so can run on a very small VM instance. The repeater will be inserting data, but typically only data that couldn’t be sunk to BigQuery in the first instance (mostly due to mutation lag).

Although you could move this to Cloud Run it’d be a significant refactor and you would take a performance hit as CR would need to process each message individually (could be a problem for the mutator from a state point of view, and the repeater from a latency point of view). Due to the concurrency limits around Cloud Run you would end up with a solution that would likely need more configuring than an equivalent always on VM.

Abhishek_Singh · September 30, 2021, 9:26am

Thanks for the response Mike. We also are looking for the option where the service that is being used for Repeater or mutator should be shut down when not in use. Also, we need to avoid maintaining any infra like a VM ourselves. Would you suggest using App Engine instead of CR?

mike · September 30, 2021, 11:33am

The mutator at least is stateful - so if you plan on restarting it on a regular basis you’d need to be storing and reloading that state somehow.

Cloud Run doesn’t support pull semantics, which both the mutator and repeater rely on (rather than push only support in Cloud Run). If you really want to avoid running a VM you could likely use the Docker containers in App Engine flex combined with manual scaling to ensure that there’s always at least 1 instance of both services running.

Topic		Replies	Views
Bigquery mutator and repeater works abnormally GCP pipeline	5	1396	October 22, 2021
Snowplow BQ mutator and repeater Storage targets	2	986	August 26, 2021
BigQuery Loader - Mutator GCP pipeline	6	1604	May 7, 2020
BigQuery Loader 0.2.0 released New releases	5	2012	March 5, 2020
Snowplow upgrade	3	1346	January 19, 2022

Running Repeater and Mutator on Serverless Platform

Related topics