we are planning to upgrade snowplow components from dataflow job to app engine in gcp. The upgrade includes moving to latest versions as below
Collector version 2.3.0 to 2.4.1
Enricher from beam enrich 2.0.1 (dataflow job )to enrich pubsub 2.0.3(app engine)
Bqloader from 0.6.4(dataflow) to 1.0.1 (on appnegine)
Gcsloader from 0.3.1 to 0.3.2 (still remains dataflow)
Repeater Mutator to 1.0.1 (from VM to appengine)
So I just want to check if you have any guides or steps/best practices which can be used for upgrade?
I recommend going all the way to version 2.4.5, which fixes a few bugs and fixes security vulnerabilities compared to 2.4.1. In most cases the upgrade from 2.3.0 is very easy; there is nothing you need to change in your configuration. But if you terminate SSL at the collector then it’s a little bit more complicated because the SSL configuration changed, as described here in the docs.
The latest version is 2.0.5. It’s good that you are moving to enrich-pubsub, because the dataflow version will soon be deprecated. Our docs site has plenty of information on how to run enrich-pubsub. Compared to the dataflow version, it has a different command line and config file.
Bqloader from 0.6.4(dataflow) to 1.0.1 (on appnegine)
Repeater Mutator to 1.0.1 (from VM to appengine)
Thank you for your reply.
We have our snowplow pipeline on GCP. Currently we are using dataflow for enricher , bqloader and mutator repeater we are running it as jar on a compute machine.
For the upgrade we are moving from dataflow to app engine. So there will be an appengine service for enricher, bqloader and mutator repeater.
So do you have any guidelines for the steps to follow for migration from dataflow to appengine for all components.
I don’t believe Appengine is officially supported infrastructure but if you were to head down this path I’d opt for the Appengine Flex runtime with a Dockerfile for each of these components - which would not be wildly dissimilar to containerising it on Kubernetes / individual virtual machines.