I am attempting to write bad rows from Hadoop to ElasticSearch, as suggested in https://github.com/snowplow/snowplow/wiki/Common-configuration#elasticsearch. I am not using the AWS ElasticSearch service but a self-contained ES cluster on EC2. I can write to it successfully from that instance via curl. However, I am again running into a proxy issue when trying to write to it from inside the EMR cluster and step. (I assume it’s a proxy issue; the step never resolves and just hangs, and every other step of the Snowplow pipeline for me has been subject to a proxy.)
My question is: At what point exactly does the EMR runner make the http request to ElasticSearch to index these bad records? I’m trying to pinpoint it so I can test adding the proxy, but it’s not clear from either the runner lib or the Scala src.
The EmrEtlRunner doesn’t directly communicate with Elasticsearch. Instead, for each bad rows bucket, it adds a step to the EMR jobflow to index that bucket. So it’s the Hadoop cluster which communicates directly with Elasticsearch. You can see these steps in the EMR console - they happen after the enrich and shred steps are finished.
I thought that might be the case going in, and added an EMR bootstrap step up front in the config.yml to run a short shell script to set those proxy environment variables, but the ElasticSearch sink steps hang for 20 minutes or more on a very small dataset, so I assume a connection error somewhere between EMR and ElasticSearch persists. Is there a point in the Hadoop config wherein that proxy should be set? Or at the least is there an opportunity to add some debugging to those job steps to get some feedback while hanging?
I’m currently trying to add the below to the additional_info field of the emr config.yml to add it to the Configurations section of the cluster:
If your Elasticsearch cluster is behind a proxy, can you just update the host and port in the config file to point at that proxy?
You could try setting es_nodes_wan_only to true in the configuration file - I know you said you weren’t using Amazon Elasticsearch Service but it’s worth a try.
Have you looked at the cluster logs? There may be useful error information printed, especially in the syslog files.