Hello,
I have a legacy snowplow pipeline with version r75-long-legged-buzzard which run in AWS- and it has not been updated since. Now the emr etl runner with old version has some old ruby gems which uses TLSv1 and AWS will soon be upgrading to TLS 1.2.
Can somebody help guiding how i should be going about this upgrade ?
I fear that since r75 is a very old version - just updating to latest could break the production pipeline.
Side by we already have a snowplow stream pipeline being setup - but its not in production yet; and we want to stabilize this production pipeline against the AWS TLS upgrade.
Any help, resources and guidance will be appreciated.
You’re quite far behind - in the 7 years since r75 we’ve entirely deprecated batch enrichment. The good news is that the protocol between trackers and collector hasn’t changed at all.
I think your best bet is to separately set up a stream pipeline with the latest versions, and then migrate traffic over to it in phases, once you’ve verified it end to end.
We have several customers who have gone from very old open-source versions like yourself, to our managed pipelines on the latest version, and this is how they have been handled.
Once you have the latest pipeline set up, and are confident that it can scale (and/or are prepared to over-provision it for the switchover), then it’s relatively simple to change the endpoints across your tracking estate. Of course downstream processes must switch over but the enriched event format, and the structure of the data in the DB haven’t actually changed. We just have tooling to make things easier (eg. automatic table creation for new schemas - assuming you’re using Redshift).
I think your best bet is to separately set up a stream pipeline with the latest versions, and then migrate traffic over to it in phases, once you’ve verified it end to end.
Yes - we are actively working on setting up the stream pipeline.
But was looking to at least upgrade the legacy pipeline to a version which uses TLS v1.2.
I am currently trying to upgrade to r89 - specifically bcz the elasticity gem is bumped to its latest version (which was using tls1.1 to talk to EMR API)
I have one more question on Enrichments in general (pardon my silly Qs - as I have inherited the legacy snowplow pipeline and still trying to wing it). So while upgrading - if there are new enrichments added - are they by default applied ? and do they affect the final data structure ?
Now, because you’re going from so far behind that version, I’m not sure what other breaking changes there would have been between 75 and 89. However, we have always followed Semver with our releases - which means that if you go through each component (collector, emr-etl-runner, enrich, etc), and look at the versions, a breaking change should be indicated by a major version increase.
If there’s a major version update then any breaking changes should be called out in the release post for the new major version - they’re all categorised in discourse so they should be searchable there.
I think because 88 → 89 was the first Spark release, it might be worth doing the upgrade in two steps if you’re struggling - to 88 first, then up to 89. At least then you can separate issues to do with previous releases to issues to do with the Spark release.
I have one more question on Enrichments in general (pardon my silly Qs - as I have inherited the legacy snowplow pipeline and still trying to wing it). So while upgrading - if there are new enrichments added - are they by default applied ? and do they affect the final data structure ?
Each enrichment has its own configuration file, with an enabled boolean setting - examples can be found in this repo. I’m not sure what enrichments might have been enabled by default, but you can disable any that you don’t want this way before going live.
Enrichments can affect the final data structure, but usually this is via adding derived contexts. They normally don’t alter the existing data. The exceptions to this would be the PII enrichment and the IP anonymisation enrichment, which do alter existing values. These are disabled by default.