Hi @noamgat,
The approach you can take is to use Spark on EMR to analyze the enriched data. We provide the Scala Analytics SDK and Python Analytics SDK that take the data in S3 and transform it into an easy to work with JSON, with the different context and event fields as nested properties so you don’t need to join different tables the way you would do in Redshift.
A tutorial to analyse your Snowplow data in Spark on EMR with Zeppelin can be found here.
Further to the Spark utilization in Snowplow pipeline, our approach is to use our EventTransformer
function, which should automatically take your Snowplow data (including the embedded JSONs) and turn it into a nice JSON format that is then straightforward to convert into a table.
Our intention is to create a standalone EventTransformer
function and maintain it as we evolve our data structure so that any Hadoop based downstream process that starts with it will continue to work as that data structure evolves. This should be much more elegant than manually creating tables in Hive. You can see it in action in this blog post: http://snowplowanalytics.com/blog/2015/12/02/data-modeling-in-spark-exploring-spark-sql/. See the first section in particular (Loading Snowplow data into Spark.)
If you adopt this approach, then you spin the EMR cluster with the option --skip shred
to EmrEtlRunner as you do not need shredded entities. Also, you do not need to use StorageLoader.
You can refer to this diagram to visualize the steps I’m referring to.