Best (enrichment) steps to take with old implementation

BartPersoons · May 29, 2018, 9:06am

Hello all,

Currently I’ve access to a S3 bucket with raw data since the beginning of 2017 (with tracker version 2.6.2). The data is collected, however it has never been processed. I want to focus on enriching the data (no shredding yet) to see what the quality of the data is. Because I don’t have a lot of experience with the enrichment part I was wondering what the best steps to take are (based on https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner)

Installing EmrEtlRunner. Does it matter which version of http://dl.bintray.com/snowplow/snowplow-generic/ I use?
Setting up YAML file. I can use the sample file (https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/config/config.yml.sample) as input, however is the info in this file dependent of the EmrEtlRunner version?

And all other best practices / tips are welcome

Greetings,
Bart

kazgurs1 · June 4, 2018, 1:45pm

Your intentions are a bit vague here, but:

I doubt EmrEtlRunner version should depend on tracker version.
Yes, EMR AMI version and software versions directly depend on the EmrEtlRunner version you are running. You can see config examples for every version tag in github within 3-enrich/emr-etl-runner/config/config.yml.sample

mike · June 4, 2018, 11:23pm

The enrichment process is independent of the tracking so you shouldn’t have any issues running EmrEtlRunner on your existing data. It’s worthwhile using the latest version of Spark Enrich and probably running on a small subset of data first such as a single day rather than the entire period.

In the EmrEtlRunner command you can set up the option to skip shredding (and other steps) so that you can just archive the data on S3.

BartPersoons · June 7, 2018, 2:13pm

I managed to get ETL job up and running for a day of data without shredding. Thnx @kazgurs1 & @mike for your help!

Topic		Replies	Views
Upgrade EmrEtlRunner to use Spark-enrich For engineers	7	1278	December 14, 2017
Problems with set-up enrichment/ emretlrunner for first time Enrichment	3	1455	April 20, 2018
Loading data from s3 to Redshift after EmrEtlRunner Troubleshooting	7	3574	November 19, 2018
Enriched data post-EmrEtlRunner is Bad Or Missing Enrichment	6	1940	December 12, 2017
EmrEtlRunner Not working Enrichment	0	1208	March 29, 2018

Best (enrichment) steps to take with old implementation

Related topics