We’re using the EmrEtlRunner with Spark enrich, and have just started seeing some errors in the enrich step. We’ve been using this same approach for quite a while, and recently had a case where the Shred step failed, but we were able to diagnose that as memory-related and fix the problem by changing the resources used by the cluster. This issue looks different, and I’m having difficulty diagnosing it. The fatal error on the enrich step is this:
Exception in thread "main" org.apache.spark.SparkException: Application application_1559790052086_0002 finished with failed status
Digging further into the log bucket in S3, I was able to find some more detailed logs in the application logs (i.e. “j-1UVX9U7OCPUBX/containers/application_1559790052086_0002/container_1559790052086_0002_01_000130/stderr” in my case). In this example, there appear to be several stack traces. Several like this:
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /local/snowplow/enriched-events/_temporary/0/_temporary/attempt_20190606033148_0001_m_000151_0/part-00151-a20fa6a3-cf3f-4943-b528-bb000f968428-c000.csv could only be replicated to 0 nodes instead of minReplication (=1). There are 4 datanode(s) running and no node(s) are excluded in this operation.
I was able to find some discussion of similar errors to this, but the cases I saw here seemed to be resolved by updating the EMR AMI to a version older than the one we’re currently running (we’re using 5.9.0, which matches what’s in the example config provided by Snowplow here https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/config/config.yml.sample#L30)
And several like this- this one I’ve had trouble finding any other discussion that references similar errors:
com.univocity.parsers.common.TextWritingException: Error writing row.
Internal state when error was thrown: recordCount=50, recordData=[{redacted actual data here for now, but it's a csv row}]
Finally, an error like this, which I’m also not finding much discussion on:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /local/snowplow/enriched-events/_temporary/0/_temporary/attempt_20190606033151_0001_m_000174_0/part-00174-a20fa6a3-cf3f-4943-b528-bb000f968428-c000.csv (inode 17358): File does not exist. Holder DFSClient_NONMAPREDUCE_-48490931_116 does not have any open files.
Our current cluster is r4.8xlarge master, 4x r4.8xlarge cores, and 0 task nodes. Anyone have any advice for how to diagnose these, or other information I can provide to help clarify the problem?