Second job for importing bad rows

tclass · June 8, 2016, 3:58pm

Could someone explain this to me in more detail?
https://groups.google.com/forum/#!searchin/snowplow-user/Tobias$20/snowplow-user/qVqjNTDkuS4/uN4Tv3X6IQAJ

fred · June 9, 2016, 10:03am

Hi Tobias,

Sorry about the cutoff in my original answer! It was probably some sort of copy and paste error.

The idea is: first run EmrEtlRunner with the --skip elasticsearch option. This will totally skip the Elasticsearch step, leaving your bad rows in S3.

Then check the identity of the bad rows bucket(s) you want to load into Elasticsearch and alter your configuration file to use those buckets as sources for the Elasticsearch step:

sources: ["s3://out/enriched/bad/run=2015--01-01-00-00-00", "s3://out/shred/bad/run=2015--01-01-00-00-00"]

Then run EmrEtlRunner again, skipping every step except the Elasticsearch step, using --skip staging,s3distcp,enrich,shred,archive_raw.

Splitting the job in two like this prevents Elasticsearch timeouts from causing the whole job to be reported as failing.

Hope that helps,
Fred

Topic		Replies	Views
Trouble sending bad rows to amazon elasticsearch service (EsHadoopInvalidRequest) AWS batch pipeline (Legacy)	4	3234	August 1, 2017
EmrEtlRunner skip issues configuration Enrichment	10	3390	July 31, 2016
[SOLVED] Bad rows for schema violations are not loaded into Elasticsearch Troubleshooting	2	1085	March 2, 2022
EmrEtlRunner::EmrExecutionError in the 3rd stage of the process AWS batch pipeline (Legacy)	4	2298	October 23, 2017
Enriched good and bad buckets are empty in the enrich AWS batch pipeline (Legacy)	7	2134	December 4, 2017

Second job for importing bad rows

Related topics