Raw events will be collecting and storing on S3 for some time till analytics processing module is ready.
So now I simply want to make sure those events could be read with Java and they have all the data needed.
My unarchived log file consists of following sections
Thus, you would probably want to have the data enriched first before applying analytics.
You might be interested in considering an implementation (or rather a part of) as depicted in the Lambda architecture: How to setup a Lambda architecture for Snowplow. You could deploy EmrEtlRunner and run it with --skip shred option to produce just the enriched events.
Hi @grzegorzewald . Looks like your code is working with base64 records and thrift stuff is unused in recovery.py.
I’m working on decoding success records not failed ones so format is different. Thank you for the code anyway.
Hi @ihor,
Thank you for the previous reply. Snowplow has rich functionality for events enrichment/storing and this is the reason we’ve chosen it. On this stage of project we don’t have enough time to put efforts on events enrichment and storing. That is why we decided to go with raw events and Kinisis-S3 sink.
Could you please give a look to the file’s format that I downloaded from S3.
It doesn’t look as containing thrift records but as UTF-8 strings with some byte delimiters.
I’ve tried to read it with elephant bird in this repo without success. Could you please give a look to this code too?
Hi @vshulga - trying to parse a Snowplow raw event is a solved problem - this is precisely what our Scala Common Enrich library does, and this library is embedded into both our Hadoop Enrich and Stream Enrich applications. Given that you have your raw events in S3 already, I would recommend going with Hadoop Enrich.