I’m trying to get Event Recovery working following the release last month.
However when I follow the steps in the docs I’m unable to get the step to execute on EMR.
If I supply the MainClass (as shown in the docs) I get the error: Unexpected argument: com.snowplowanalytics.snowplow.event.recovery.Main
If I don’t supply that, I get the error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
The cluster was created with the following config:
aws emr create-cluster --release-label emr-5.19.0
–instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.large
–applications Name=Spark Name=Hadoop
–name=“Snowplow Event Recovery”
Are there any known issues around this or anything obvious I’m likely to have missed?
That got things moving - the step is running now. I’ll try and address that PR later on tonight if it’s still open then.
I’m still having issues though if you can help.
I’m seeing the step running but it doesn’t finish. All I can see in the available logs are messages saying that it’s running (in the stderr logs for some reason?): 19/02/19 17:46:17 INFO Client: Application report for application_1550597023778_0001 (state: RUNNING)
And in the controller logs: INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-... INFO Process still running
I’m seeing nothing output, the output directory hasn’t been created etc
I’m testing this on a small amount of data, so I wouldn’t expect the job to take more than a few minutes to run. Even with invalid config I’d expect the job to complete with nothing recovered?
Don’t if it makes a difference - but my input and output are S3, not HDFS. Though I assumed it was fine given I received an exception about the output already existing if I created it beforehand.