Minimal Enrich Setup?

ihor · June 26, 2017, 8:29pm

The approach you can take is to use Spark on EMR to analyze the enriched data. We provide the Scala Analytics SDK and Python Analytics SDK that take the data in S3 and transform it into an easy to work with JSON, with the different context and event fields as nested properties so you don’t need to join different tables the way you would do in Redshift.

A tutorial to analyse your Snowplow data in Spark on EMR with Zeppelin can be found here.

Further to the Spark utilization in Snowplow pipeline, our approach is to use our EventTransformer function, which should automatically take your Snowplow data (including the embedded JSONs) and turn it into a nice JSON format that is then straightforward to convert into a table.

Our intention is to create a standalone EventTransformer function and maintain it as we evolve our data structure so that any Hadoop based downstream process that starts with it will continue to work as that data structure evolves. This should be much more elegant than manually creating tables in Hive. You can see it in action in this blog post: http://snowplowanalytics.com/blog/2015/12/02/data-modeling-in-spark-exploring-spark-sql/. See the first section in particular (Loading Snowplow data into Spark.)

If you adopt this approach, then you spin the EMR cluster with the option --skip shred to EmrEtlRunner as you do not need shredded entities. Also, you do not need to use StorageLoader.

You can refer to this diagram to visualize the steps I’m referring to.

Topic		Replies	Views
Understanding my options for Transforming/Loading For engineers	4	746	October 19, 2022
Best (enrichment) steps to take with old implementation Enrichment	3	1247	June 7, 2018
Collector -> S3 loader Collectors	3	1476	June 7, 2020
Snowplow Analyticss on Azure For engineers	0	1276	June 15, 2018
AWS Athena as an alternative data store For engineers	0	1687	January 11, 2017

Minimal Enrich Setup?

Related topics