EMR job writes empty files in enriched.bad and shredded.bad buckets

tyomo4ka · March 31, 2017, 3:10am

I noticed some bad records in realtime pipeline in Elasticsearch. However when I looked into batch processing pipeline I noticed that EMR job just writes empty files in enriched.bad and shredded.bad buckets. It looks like this:

Any idea why it might happen?

alex · March 31, 2017, 9:05am

Hey @tyomo4ka - to be getting bads in your RT pipeline but not in batch suggests some difference between the two pipelines in terms of event validation or enrichment.

What messages are you seeing in the bads in your RT pipeline in Elasticsearch?

tyomo4ka · April 3, 2017, 1:09am

HI @alex - I see messages like this in RT pipeline in bad index in Elasticsearch:

{
  "level": "error",
  "message": "error: instance type (string) does not match any allowed primitive type (allowed: [\"integer\"])\n    level: \"error\"\n    schema: {\"loadingURI\":\"#\",\"pointer\":\"/properties/age\"}\n    instance: {\"pointer\":\"/age\"}\n    domain: \"validation\"\n    keyword: \"type\"\n    found: \"string\"\n    expected: [\"integer\"]\n"
}

It’s pretty much obvious issue. It is expected that emr etl runner won’t enrich and shrink data as the data doesn’t match schema.

My problem is that instead of bad events in enriched.bad bucket in S3 I get these empty files.

P.S.: I also have some empty files in shredded.bad. I guess it might be related to the same issue.

alex · April 3, 2017, 12:44pm

Hi @tyomo4ka - that’s odd. Empty files means no events failed validation that run. Have you checked all the run folders?

tyomo4ka · April 10, 2017, 11:56pm

Hi @alex!

Yeah I did check all run folders. I was unable to find any non-empty file in enriched.bad and shredded.bad folders.

In shredded.good folder I found correctly shredded data. And I also can see correct data in Redshift. The only problem is those empty files.

I use self-describing events just for a case.

Topic		Replies	Views
EMR job failing Troubleshooting	4	952	November 15, 2021
Error on EmrEtlRunner, S3 not empty Enrichment	2	2068	December 16, 2016
EmrEtlRunner skip issues configuration Enrichment	10	3390	July 31, 2016
Output (enriched/good and enriched/bad) are all empty! AWS batch pipeline (Legacy)	2	1747	February 27, 2017
EMR Shredding fails randomly Enrichment	12	1660	February 23, 2019

EMR job writes empty files in enriched.bad and shredded.bad buckets

Related topics