[enrich] spark: Enrich Raw Events - fails

Hi,
I am new to Snowplow and I am trying to follow the guide at https://www.simoahava.com/analytics/snowplow-full-setup-with-google-analytics-tracking/
I run into problems when running the EmrEtlRunner.

The step with “[enrich] spark: Enrich Raw Events” fails.

My config.yml looks like this:
aws:
access_key_id: …
secret_access_key: …
s3:
region: eu-west-1
buckets:
assets: s3://snowplow-hosted-assets
jsonpath_assets:
log: s3://avrang-log
encrypted: false
raw:
in:
- s3://…
processing: s3://avrang-data/processing
archive: s3://avrang-data/archive/raw
enriched:
good: s3://avrang-data/enriched/good
bad: s3://avrang-data/enriched/bad
errors:
archive: s3://avrang-data/archive/enriched
shredded:
good: s3://avrang-data/shredded/good
bad: s3://avrang-data/shredded/bad
errors:
archive: s3://avrang-data/archive/shredded
consolidate_shredded_output: false
emr:
ami_version: 5.29.0
region: eu-west-1
jobflow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole
placement:
ec2_subnet_id: subnet-xxxxxxx
ec2_key_name: xxxxxxx
security_configuration:
bootstrap:
software:
hbase:
lingual:
jobflow:
job_name: Snowplow ETL
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
core_instance_bid: 0.015
core_instance_ebs:
volume_size: 100
volume_type: “gp2”
volume_iops: 400
ebs_optimized: false
task_instance_count: 0
task_instance_type: m1.medium
task_instance_bid: 0.015
bootstrap_failure_tries: 3
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: “1”
spark:
maximizeResourceAllocation: “true”
additional_info:
collectors:
format: clj-tomcat
enrich:
versions:
spark_enrich: 1.18.0
continue_on_unexpected_error: false
output_compression: NONE
storage:
versions:
rdb_loader: 0.14.0
rdb_shredder: 0.13.1
hadoop_elasticsearch: 0.1.0
monitoring:
tags: {}
logging:
level: DEBUG

Controller log:

2020-03-16T08:41:19.994Z INFO Ensure step 3 jar file command-runner.jar
2020-03-16T08:41:19.995Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class com.snowplowanalytics.snowplow.enrich.spark.EnrichJob --master yarn --deploy-mode cluster s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.18.0.jar --input-format clj-tomcat --etl-timestamp 1584346843301 --iglu-config ewogICJzY2hlbWEiOiAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRpY3MuaWdsdS9yZXNvbHZlci1jb25maWcvanNvbnNjaGVtYS8xLTAtMSIsCiAgImRhdGEiOiB7CiAgICAiY2FjaGVTaXplIjogNTAwLAogICAgInJlcG9zaXRvcmllcyI6IFsKICAgICAgewogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCIsCiAgICAgICAgInByaW9yaXR5IjogMCwKICAgICAgICAidmVuZG9yUHJlZml4ZXMiOiBbICJjb20uc25vd3Bsb3dhbmFseXRpY3MiIF0sCiAgICAgICAgImNvbm5lY3Rpb24iOiB7CiAgICAgICAgICAiaHR0cCI6IHsKICAgICAgICAgICAgInVyaSI6ICJodHRwOi8vaWdsdWNlbnRyYWwuY29tIgogICAgICAgICAgfQogICAgICAgIH0KICAgICAgfSwKICAgICAgewogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCAtIEdDUCBNaXJyb3IiLAogICAgICAgICJwcmlvcml0eSI6IDEsCiAgICAgICAgInZlbmRvclByZWZpeGVzIjogWyAiY29tLnNub3dwbG93YW5hbHl0aWNzIiBdLAogICAgICAgICJjb25uZWN0aW9uIjogewogICAgICAgICAgImh0dHAiOiB7CiAgICAgICAgICAgICJ1cmkiOiAiaHR0cDovL21pcnJvcjAxLmlnbHVjZW50cmFsLmNvbSIKICAgICAgICAgIH0KICAgICAgICB9CiAgICAgIH0KICAgIF0KICB9Cn0= --enrichments eyJzY2hlbWEiOiJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5zbm93cGxvdy9lbnJpY2htZW50cy9qc29uc2NoZW1hLzEtMC0wIiwiZGF0YSI6W119 --input-folder hdfs:///local/snowplow/raw-events/* --output-folder hdfs:///local/snowplow/enriched-events/ --bad-folder s3://avrang-data/enriched/bad/run=2020-03-16-09-20-43/'
INFO Environment:
  PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
  LESS_TERMCAP_md=[01;38;5;208m
  LESS_TERMCAP_me=[0m
  HISTCONTROL=ignoredups
  LESS_TERMCAP_mb=[01;31m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  UPSTART_JOB=rc
  LESS_TERMCAP_se=[0m
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=eu-west-1
  AWS_ELB_HOME=/opt/aws/apitools/elb
  LESS_TERMCAP_us=[04;38;5;111m
  EC2_HOME=/opt/aws/apitools/ec2
  TERM=linux
  runlevel=3
  LANG=en_US.UTF-8
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  MAIL=/var/spool/mail/hadoop
  LESS_TERMCAP_ue=[0m
  LOGNAME=hadoop
  PWD=/
  LANGSH_SOURCED=1
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-34F9J63MTL3DW/tmp
  _=/etc/alternatives/jre/bin/java
  CONSOLETYPE=serial
  RUNLEVEL=3
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  previous=N
  UPSTART_EVENTS=runlevel
  AWS_PATH=/opt/aws
  USER=hadoop
  UPSTART_INSTANCE=
  PREVLEVEL=N
  HADOOP_LOGFILE=syslog
  PYTHON_INSTALL_LAYOUT=amzn
  HOSTNAME=ip-172-31-7-85
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-34F9J63MTL3DW
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  EMR_STEP_ID=s-34F9J63MTL3DW
  SHLVL=5
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-34F9J63MTL3DW/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-34F9J63MTL3DW/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-34F9J63MTL3DW
INFO ProcessRunner started child process 14816
2020-03-16T08:41:20.000Z INFO HadoopJarStepRunner.Runner: startRun() called for s-34F9J63MTL3DW Child Pid: 14816
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 1 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 66 seconds
2020-03-16T08:42:26.321Z INFO Step created jobs: 
2020-03-16T08:42:26.322Z WARN Step failed with exitCode 1 and took 66 seconds

stderr log:
20/03/16 08:41:28 WARN DependencyUtils: Skip remote jar s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.18.0.jar.
20/03/16 08:41:30 INFO RMProxy: Connecting to ResourceManager at ip-172-31-7-85.eu-west-1.compute.internal/172.31.7.85:8032
20/03/16 08:41:31 INFO Client: Requesting a new application from cluster with 1 NodeManagers
20/03/16 08:41:31 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (2048 MB per container)
20/03/16 08:41:31 INFO Client: Will allocate AM container, with 1024 MB memory including 384 MB overhead
20/03/16 08:41:31 INFO Client: Setting up container launch context for our AM
20/03/16 08:41:31 INFO Client: Setting up the launch environment for our AM container
20/03/16 08:41:31 INFO Client: Preparing resources for our AM container
20/03/16 08:41:32 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/03/16 08:41:38 INFO Client: Uploading resource file:/mnt/tmp/spark-244a84da-e796-4949-8d12-a8433e9d0286/__spark_libs__1086151619941629950.zip -> hdfs://ip-172-31-7-85.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1584347434716_0003/__spark_libs__1086151619941629950.zip
20/03/16 08:41:47 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
20/03/16 08:41:47 INFO Client: Uploading resource s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.18.0.jar -> hdfs://ip-172-31-7-85.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1584347434716_0003/snowplow-spark-enrich-1.18.0.jar
20/03/16 08:41:50 INFO S3NativeFileSystem: Opening ‘s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.18.0.jar’ for reading
20/03/16 08:41:59 INFO Client: Uploading resource file:/mnt/tmp/spark-244a84da-e796-4949-8d12-a8433e9d0286/__spark_conf__3496232266251664935.zip -> hdfs://ip-172-31-7-85.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1584347434716_0003/spark_conf.zip
20/03/16 08:42:00 INFO SecurityManager: Changing view acls to: hadoop
20/03/16 08:42:00 INFO SecurityManager: Changing modify acls to: hadoop
20/03/16 08:42:00 INFO SecurityManager: Changing view acls groups to:
20/03/16 08:42:00 INFO SecurityManager: Changing modify acls groups to:
20/03/16 08:42:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/03/16 08:42:08 INFO Client: Submitting application application_1584347434716_0003 to ResourceManager
20/03/16 08:42:08 INFO YarnClientImpl: Submitted application application_1584347434716_0003
20/03/16 08:42:09 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:09 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1584348128456
final status: UNDEFINED
tracking URL: http://ip-172-31-7-85.eu-west-1.compute.internal:20888/proxy/application_1584347434716_0003/
user: hadoop
20/03/16 08:42:10 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:11 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:12 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:13 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:14 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:15 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:16 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:17 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:18 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:19 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:20 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:21 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:22 INFO Client: Application report for application_1584347434716_0003 (state: ACCEPTED)
20/03/16 08:42:23 INFO Client: Application report for application_1584347434716_0003 (state: FAILED)
20/03/16 08:42:23 INFO Client:
client token: N/A
diagnostics: Application application_1584347434716_0003 failed 1 times due to AM Container for appattempt_1584347434716_0003_000001 exited with exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1584347434716_0003_01_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-7-85.eu-west-1.compute.internal:8088/cluster/app/application_1584347434716_0003 Then click on links to logs of each attempt.
. Failing the application.
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1584348128456
	 final status: FAILED
	 tracking URL: http://ip-172-31-7-85.eu-west-1.compute.internal:8088/cluster/app/application_1584347434716_0003
	 user: hadoop
20/03/16 08:42:23 ERROR Client: Application diagnostics message: Application application_1584347434716_0003 failed 1 times due to AM Container for appattempt_1584347434716_0003_000001 exited with  exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1584347434716_0003_01_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
	at org.apache.hadoop.util.Shell.run(Shell.java:869)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-7-85.eu-west-1.compute.internal:8088/cluster/app/application_1584347434716_0003 Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" org.apache.spark.SparkException: Application application_1584347434716_0003 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1149)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/03/16 08:42:23 INFO ShutdownHookManager: Shutdown hook called
20/03/16 08:42:23 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-244a84da-e796-4949-8d12-a8433e9d0286
20/03/16 08:42:23 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-e79bb151-82a4-4d69-bc5f-ed54908699bd
Command exiting with ret '1'

Any suggestions on solving the problem?

Thanks,
Olzi

@Olzi, I don’t think AMI version 5.29.0 is supported. You also seem to be using very old EC2 instances (for that AMI version) as well as the Clojure collector that is about to be deprecated. Try using the same AMI as in Simo’s post - 5.9.0 and see if that helps. Do note as you have staged the files already you would need to run EmrEtlRunner with --skip staging to continue from that failure.

Thanks! I changed AMI to 5.9.0 and it seems to work! The process is not complete yet. Crossing my fingers.
Best regards,
Olzi