Hello,
I inherited a snowplow legacy batch pipeline which was running with clojure collector hosted with AWS Beanstalk. And we started getting issue with clojure collector recently bcz of support end for the underlying platform. So long story short - we need to move out of clojure collector and fast. Unfortunately- setting up stream pipeline is in progress and not there yet to handle the production traffic.
So wanted to understand what issues we might face if we use deprecated cloudfront collector to replace the clojure collector and keep the production pipelines floating in the mean time.
Hi @Dhruvi so the CloudFront Collector as such is just really using the log rotation mechanism of CloudFront to capture GET requests sent to the /i or pixel endpoint.
The main problem I could see happening is that the format of those logs changes over-time and the last release which supported CloudFront logs as an ingestion format of data into the pipeline happened several years ago at this point. So ultimately EMR ETL Runner might just not support the current CloudFront logs format.
A better option in the short-term if you have to maintain the current batch process in the old architecture would be to:
Setup a Stream Collector
Save the raw data that comes out of the Collector to s3 with our S3 Loader in LZO format
This means you maintain the existing enrichment + validation + loading stages you have setup but can fix the Collection point → the format of the Collector data has not changed in a very long time so even the latest version should still be compatible.
The other options are of course to take a look at spinning up a pipeline with our quick-start examples or if the management of the pipeline is proving troublesome to go for a hosted option.
I forgot to mention that we have r90 version of batch pipeline running in production. That said - I have specifically picked the pixel file from same versiion. Could the cloudfront log format still be an issue, given all are from same verision?
I have tried this approach up untill getting the raw data to s3 bucket with the s3 loader (gz format- getting some binary data) - but i didnt see any option in the emr etl runner to process lzo format
I forgot to mention that we have r90 version of batch pipeline running in production. That said - I have specifically picked the pixel file from same versiion. Could the cloudfront log format still be an issue, given all are from same verision?
CloudFront is an AWS service and it has evolved over-time - the log format it outputs is outside of our control. It could work still or it could not - you would need to spin up a pipeline with that as the source to validate that!
Note: If you go down this route you cannot use POST requests (only /i GET requests will work).
I have tried this approach up untill getting the raw data to s3 bucket with the s3 loader (gz format- getting some binary data) - but i didnt see any option in the emr etl runner to process lzo format
IIRC both should work. the LZO format yields better performance on EMR as you can more evenly distribute the load across the cluster. LZO or GZIP is really just a detail for EMR - the main important settings in the config are:
Where aws.s3.buckets.enriched.stream is the path to where your S3 Loader is dumping the raw LZO collector payloads for processing and collectors.format is set to thrift (which is the output format from the Collector).