There were a few times where enriched events were not copied to S3, due to this error:
Exception in thread "main" java.io.IOException: Error opening job jar: /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar
at org.apache.hadoop.util.RunJar.run(RunJar.java:160)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.util.zip.ZipException: zip file is empty
at java.util.zip.ZipFile.open(Native Method)
Re-running the etl-runner without staging fixes the issue. I found troubleshooting tips here:
I guess it’s difficult to settle such issues easily, but I’m sure most users are using some task scheduler like Jenkins for their snowplow pipeline, so it’s not ideal to be re-running manually as the pipeline is not self-healing. Anyone else had this? Any ideas?
Here’s my config:
emr:
ami_version: 4.5.0
region: eu-central-1
jobflow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole
placement:
ec2_subnet_id: subnet-[...]
ec2_key_name: my_key
bootstrap: []
software:
hbase:
lingual:
jobflow:
master_instance_type: m4.large
core_instance_count: 3
core_instance_type: c3.4xlarge
task_instance_count: 0
task_instance_type: c4.large
task_instance_bid:
bootstrap_failure_tries: 3
additional_info:
collectors:
format: clj-tomcat
enrich:
job_name: snowplow ETL
versions:
hadoop_enrich: 1.8.0
hadoop_shred: 0.10.0
hadoop_elasticsearch: 0.1.0
continue_on_unexpected_error: false
output_compression: GZIP