We are trying to set up a minimalist snowplow collection process for an internal analytics process.
We already have a system that can take unstructured JSON lines and process them.
What we would like is to set up simple Track->Collect->Parse process.
We set up the track & collect processes (using the cloudfront collector), and we are trying to understand what is the simplest way to convert them from the cloudfront format to something more manageable.
Which of the snowplow solutions is the most minimalist one to perform a “Read cloudfront, write json” style operation?
The approach you can take is to use Spark on EMR to analyze the enriched data. We provide the Scala Analytics SDK and Python Analytics SDK that take the data in S3 and transform it into an easy to work with JSON, with the different context and event fields as nested properties so you don’t need to join different tables the way you would do in Redshift.
A tutorial to analyse your Snowplow data in Spark on EMR with Zeppelin can be found here.
Further to the Spark utilization in Snowplow pipeline, our approach is to use our
EventTransformer function, which should automatically take your Snowplow data (including the embedded JSONs) and turn it into a nice JSON format that is then straightforward to convert into a table.
Our intention is to create a standalone
EventTransformer function and maintain it as we evolve our data structure so that any Hadoop based downstream process that starts with it will continue to work as that data structure evolves. This should be much more elegant than manually creating tables in Hive. You can see it in action in this blog post: http://snowplowanalytics.com/blog/2015/12/02/data-modeling-in-spark-exploring-spark-sql/. See the first section in particular (Loading Snowplow data into Spark.)
If you adopt this approach, then you spin the EMR cluster with the option
--skip shred to EmrEtlRunner as you do not need shredded entities. Also, you do not need to use StorageLoader.
You can refer to this diagram to visualize the steps I’m referring to.
Yet, another approach is to use Athena (instead of Spark). Here’s the link to just published totorial: Using AWS Athena to query the ‘good’ bucket on S3
I may have miswritten. My target here is not to analyse the data but to prepare it for an internal system which expects files with json lines. I would just like to convert a cloudfront log file with snowplow conventions (analytics params in uri query params etc) into a json-per-line file with the good events, and trying to find the easiest to setup+maintain pipeline to do this.
You won’t get the “raw” events in JSON format. Here’s the documentation on the format of the events from the Cloudfront collector: https://github.com/snowplow/snowplow/wiki/Collector-logging-formats#the-cloudfront-logging-format-with-cloudfront-naming-convention. In fact, it’s a format imposed by Amazon: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html.
If your “Parse process” can work with that format then you don’t really need anything but what you already depicted: Tracker => Cloudfront collector => S3 (“raw”) bucket => Parse process
I added S3 to the graph as the Collector would have a log rotation enabled to push the files on hourly (or so) basis to S3.