Trouble sending bad rows to amazon elasticsearch service (EsHadoopInvalidRequest)

rbolkey · December 2, 2016, 10:50pm

Hi all!

I’m having trouble sending bad rows to elasticsearch from the emr etl runner.

I’m using r83, and the closest to the root cause that I can find is this:

cascading.tuple.TupleException: unable to sink into output identifier: snowplow/bad_rows
at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:160)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:95)
at cascading.tuple.TupleEntrySchemeCollector.add(TupleEntrySchemeCollector.java:134)
at cascading.flow.stream.SinkStage.receive(SinkStage.java:90)
at cascading.flow.stream.SinkStage.receive(SinkStage.java:37)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:80)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:133)
at com.twitter.scalding.MapFunction.operate(Operations.scala:59)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:166)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: null
null
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:427)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:385)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:363)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:367)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:121)
at org.elasticsearch.hadoop.rest.RestClient.esVersion(RestClient.java:513)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:177)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:378)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:867)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:619)
at cascading.tap.hadoop.util.MeasuredOutputCollector.collect(MeasuredOutputCollector.java:69)
at org.elasticsearch.hadoop.cascading.EsHadoopScheme.sink(EsHadoopScheme.java:205)
at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:153)
… 21 more

which I found in the EMR logs for the job here:

s3://…/snowplow/logs/j-25SOPSJS5RE05/containers/application_1480715933846_0001/container_1480715933846_0001_01_000001/syslog.gz

I’ve added the EMR role and EC2 instance profile role IAM roles to the access policy for the elasticsearch service thinking it may be permissions, but it’s a rather cryptic error that hopefully others have seen.

I’ve manually created the index (snowplow), but haven’t found any information about the type mapping, so I’m assuming it’s auto created?

Thanks
rick

Timmycarbone · July 3, 2017, 1:25pm

Hey!

I’m wondering if you found the problem with this issue?
It started to happen to me a couple of days ago and I can’t figure out what’s up. My configuration didn’t change.

I don’t think it’s a permission issue. I’m using an Nginx proxy behind an Elastic IP that is whitelisted. Hitting the proxy on port 8080.

Any news on this?

Thanks
Tim

rbolkey · July 3, 2017, 9:08pm

I gave up. And it’s much easier to use Athena to look at the bad records (there are guides in this forum to walk you through.

kazgurs1 · July 11, 2017, 7:27am

I remember having a similar issue with self-hosted ES, do you have es.nodes.wan.only: set to true in our hadoop es config?

Timmycarbone · August 1, 2017, 1:37pm

It’s set to true yes.

Anyway, I went with Athena too in the end. Here’s the reference for whoever lands here:

Topic		Replies	Views
Sending bad rows to Elasticsearch AWS batch pipeline (Legacy)	2	1413	April 28, 2017
Debugging bad rows in Athena [tutorial] For data modelers & consumers	4	11213	January 22, 2017
Elasticsearch load failed: "java.io.IOException: Not a file: s3://..." Storage targets	1	2543	June 10, 2016
Second job for importing bad rows Troubleshooting	1	1466	June 9, 2016
Process bad rows from Elasticsearch and form them into good rows Troubleshooting	5	3229	May 16, 2017

Trouble sending bad rows to amazon elasticsearch service (EsHadoopInvalidRequest)

Related topics