The “Spark: Shred Enrich Events” step continues to fail, and I’m having trouble finding the cause. The log files are below. Can some help me find the root issue?
We are using the Clojure collector and attempting to write to Redshift
Shred Enrich Event - Step Details
Status :FAILED
Reason :
Log File :s3://snowplow-emretlrunner-out/log/j-1Z6IDPVQ9FWMQ/steps/s-2UH0K79D2I0ZD/stderr.gz
Details :Exception in thread "main" org.apache.spark.SparkException: Application application_1579119791561_0002 finished with failed status
JAR location :command-runner.jar
Main class :None
Arguments :spark-submit --class com.snowplowanalytics.snowplow.storage.spark.ShredJob --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar --iglu-config ew0KICAic2NoZW1hIjogImlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLmlnbHUvcmVzb2x2ZXItY29uZmlnL2pzb25zY2hlbWEvMS0wLTEiLA0KICAiZGF0YSI6IHsNCiAgICAiY2FjaGVTaXplIjogNTAwLA0KICAgICJyZXBvc2l0b3JpZXMiOiBbDQogICAgICB7DQogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCIsDQogICAgICAgICJwcmlvcml0eSI6IDAsDQogICAgICAgICJ2ZW5kb3JQcmVmaXhlcyI6IFsgImNvbS5zbm93cGxvd2FuYWx5dGljcyIgXSwNCiAgICAgICAgImNvbm5lY3Rpb24iOiB7DQogICAgICAgICAgImh0dHAiOiB7DQogICAgICAgICAgICAidXJpIjogImh0dHA6Ly9pZ2x1Y2VudHJhbC5jb20iDQogICAgICAgICAgfQ0KICAgICAgICB9DQogICAgICB9LA0KICAgICAgew0KICAgICAgICAibmFtZSI6ICJJZ2x1IENlbnRyYWwgLSBHQ1AgTWlycm9yIiwNCiAgICAgICAgInByaW9yaXR5IjogMSwNCiAgICAgICAgInZlbmRvclByZWZpeGVzIjogWyAiY29tLnNub3dwbG93YW5hbHl0aWNzIiBdLA0KICAgICAgICAiY29ubmVjdGlvbiI6IHsNCiAgICAgICAgICAiaHR0cCI6IHsNCiAgICAgICAgICAgICJ1cmkiOiAiaHR0cDovL21pcnJvcjAxLmlnbHVjZW50cmFsLmNvbSINCiAgICAgICAgICB9DQogICAgICAgIH0NCiAgICAgIH0NCiAgICBdDQogIH0NCn0NCg== --input-folder hdfs:///local/snowplow/enriched-events/* --output-folder hdfs:///local/snowplow/shredded-events/ --bad-folder s3://snowplow-emretlrunner-out/shredded/bad/run=2020-01-15-20-18-25/
Action on failure:Terminate cluster
Controller file
2020-01-15T20:28:21.420Z INFO Ensure step 2 jar file command-runner.jar
2020-01-15T20:28:21.421Z INFO StepRunner: Created Runner for step 2
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class com.snowplowanalytics.snowplow.storage.spark.ShredJob --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar --iglu-config ew0KICAic2NoZW1hIjogImlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLmlnbHUvcmVzb2x2ZXItY29uZmlnL2pzb25zY2hlbWEvMS0wLTEiLA0KICAiZGF0YSI6IHsNCiAgICAiY2FjaGVTaXplIjogNTAwLA0KICAgICJyZXBvc2l0b3JpZXMiOiBbDQogICAgICB7DQogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCIsDQogICAgICAgICJwcmlvcml0eSI6IDAsDQogICAgICAgICJ2ZW5kb3JQcmVmaXhlcyI6IFsgImNvbS5zbm93cGxvd2FuYWx5dGljcyIgXSwNCiAgICAgICAgImNvbm5lY3Rpb24iOiB7DQogICAgICAgICAgImh0dHAiOiB7DQogICAgICAgICAgICAidXJpIjogImh0dHA6Ly9pZ2x1Y2VudHJhbC5jb20iDQogICAgICAgICAgfQ0KICAgICAgICB9DQogICAgICB9LA0KICAgICAgew0KICAgICAgICAibmFtZSI6ICJJZ2x1IENlbnRyYWwgLSBHQ1AgTWlycm9yIiwNCiAgICAgICAgInByaW9yaXR5IjogMSwNCiAgICAgICAgInZlbmRvclByZWZpeGVzIjogWyAiY29tLnNub3dwbG93YW5hbHl0aWNzIiBdLA0KICAgICAgICAiY29ubmVjdGlvbiI6IHsNCiAgICAgICAgICAiaHR0cCI6IHsNCiAgICAgICAgICAgICJ1cmkiOiAiaHR0cDovL21pcnJvcjAxLmlnbHVjZW50cmFsLmNvbSINCiAgICAgICAgICB9DQogICAgICAgIH0NCiAgICAgIH0NCiAgICBdDQogIH0NCn0NCg== --input-folder hdfs:///local/snowplow/enriched-events/* --output-folder hdfs:///local/snowplow/shredded-events/ --bad-folder s3://snowplow-emretlrunner-out/shredded/bad/run=2020-01-15-20-18-25/'
INFO Environment:
PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
LESS_TERMCAP_md=[01;38;5;208m
LESS_TERMCAP_me=[0m
HISTCONTROL=ignoredups
LESS_TERMCAP_mb=[01;31m
AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
UPSTART_JOB=rc
LESS_TERMCAP_se=[0m
HISTSIZE=1000
HADOOP_ROOT_LOGGER=INFO,DRFA
JAVA_HOME=/etc/alternatives/jre
AWS_DEFAULT_REGION=us-east-1
AWS_ELB_HOME=/opt/aws/apitools/elb
LESS_TERMCAP_us=[04;38;5;111m
EC2_HOME=/opt/aws/apitools/ec2
TERM=linux
XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
runlevel=3
LANG=en_US.UTF-8
AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
MAIL=/var/spool/mail/hadoop
LESS_TERMCAP_ue=[0m
LOGNAME=hadoop
PWD=/
LANGSH_SOURCED=1
HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-2UH0K79D2I0ZD/tmp
_=/etc/alternatives/jre/bin/java
CONSOLETYPE=serial
RUNLEVEL=3
LESSOPEN=||/usr/bin/lesspipe.sh %s
previous=N
UPSTART_EVENTS=runlevel
AWS_PATH=/opt/aws
USER=hadoop
UPSTART_INSTANCE=
PREVLEVEL=N
HADOOP_LOGFILE=syslog
PYTHON_INSTALL_LAYOUT=amzn
HOSTNAME=ip-100-0-24-154
NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-2UH0K79D2I0ZD
EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
SHLVL=5
HOME=/home/hadoop
HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-2UH0K79D2I0ZD/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-2UH0K79D2I0ZD/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-2UH0K79D2I0ZD
INFO ProcessRunner started child process 9618 :
hadoop 9618 4100 0 20:28 ? 00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class com.snowplowanalytics.snowplow.storage.spark.ShredJob --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar --iglu-config ew0KICAic2NoZW1hIjogImlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLmlnbHUvcmVzb2x2ZXItY29uZmlnL2pzb25zY2hlbWEvMS0wLTEiLA0KICAiZGF0YSI6IHsNCiAgICAiY2FjaGVTaXplIjogNTAwLA0KICAgICJyZXBvc2l0b3JpZXMiOiBbDQogICAgICB7DQogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCIsDQogICAgICAgICJwcmlvcml0eSI6IDAsDQogICAgICAgICJ2ZW5kb3JQcmVmaXhlcyI6IFsgImNvbS5zbm93cGxvd2FuYWx5dGljcyIgXSwNCiAgICAgICAgImNvbm5lY3Rpb24iOiB7DQogICAgICAgICAgImh0dHAiOiB7DQogICAgICAgICAgICAidXJpIjogImh0dHA6Ly9pZ2x1Y2VudHJhbC5jb20iDQogICAgICAgICAgfQ0KICAgICAgICB9DQogICAgICB9LA0KICAgICAgew0KICAgICAgICAibmFtZSI6ICJJZ2x1IENlbnRyYWwgLSBHQ1AgTWlycm9yIiwNCiAgICAgICAgInByaW9yaXR5IjogMSwNCiAgICAgICAgInZlbmRvclByZWZpeGVzIjogWyAiY29tLnNub3dwbG93YW5hbHl0aWNzIiBdLA0KICAgICAgICAiY29ubmVjdGlvbiI6IHsNCiAgICAgICAgICAiaHR0cCI6IHsNCiAgICAgICAgICAgICJ1cmkiOiAiaHR0cDovL21pcnJvcjAxLmlnbHVjZW50cmFsLmNvbSINCiAgICAgICAgICB9DQogICAgICAgIH0NCiAgICAgIH0NCiAgICBdDQogIH0NCn0NCg== --input-folder hdfs:///local/snowplow/enriched-events/* --output-folder hdfs:///local/snowplow/shredded-events/ --bad-folder s3://snowplow-emretlrunner-out/shredded/bad/run=2020-01-15-20-18-25/
2020-01-15T20:28:23.475Z INFO HadoopJarStepRunner.Runner: startRun() called for s-2UH0K79D2I0ZD Child Pid: 9618
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 1 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 394 seconds
2020-01-15T20:34:55.613Z INFO Step created jobs:
2020-01-15T20:34:55.613Z WARN Step failed with exitCode 1 and took 394 seconds
Warning: Skip remote jar s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar.
20/01/15 20:28:24 INFO RMProxy: Connecting to ResourceManager at ip-100-0-24-154.ec2.internal/100.0.24.154:8032
20/01/15 20:28:24 INFO Client: Requesting a new application from cluster with 1 NodeManagers
20/01/15 20:28:24 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (5632 MB per container)
20/01/15 20:28:24 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
20/01/15 20:28:24 INFO Client: Setting up container launch context for our AM
20/01/15 20:28:24 INFO Client: Setting up the launch environment for our AM container
20/01/15 20:28:24 INFO Client: Preparing resources for our AM container
20/01/15 20:28:25 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/01/15 20:28:27 INFO Client: Uploading resource file:/mnt/tmp/spark-c7351112-d3ef-4f42-abcb-c86e8cabaf6e/__spark_libs__2861957843382159812.zip -> hdfs://ip-100-0-24-154.ec2.internal:8020/user/hadoop/.sparkStaging/application_1579119791561_0002/__spark_libs__2861957843382159812.zip
20/01/15 20:28:28 WARN RoleMappings: Found no mappings configured with 'fs.s3.authorization.roleMapping', credentials resolution may not work as expected
20/01/15 20:28:29 INFO Client: Uploading resource s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar -> hdfs://ip-100-0-24-154.ec2.internal:8020/user/hadoop/.sparkStaging/application_1579119791561_0002/snowplow-rdb-shredder-0.13.1.jar
20/01/15 20:28:29 INFO S3NativeFileSystem: Opening 's3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar' for reading
20/01/15 20:28:30 INFO Client: Uploading resource file:/mnt/tmp/spark-c7351112-d3ef-4f42-abcb-c86e8cabaf6e/__spark_conf__3351045635256236100.zip -> hdfs://ip-100-0-24-154.ec2.internal:8020/user/hadoop/.sparkStaging/application_1579119791561_0002/__spark_conf__.zip
20/01/15 20:28:30 INFO SecurityManager: Changing view acls to: hadoop
20/01/15 20:28:30 INFO SecurityManager: Changing modify acls to: hadoop
20/01/15 20:28:30 INFO SecurityManager: Changing view acls groups to:
20/01/15 20:28:30 INFO SecurityManager: Changing modify acls groups to:
20/01/15 20:28:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/01/15 20:28:30 INFO Client: Submitting application application_1579119791561_0002 to ResourceManager
20/01/15 20:28:31 INFO YarnClientImpl: Submitted application application_1579119791561_0002
20/01/15 20:28:32 INFO Client: Application report for application_1579119791561_0002 (state: ACCEPTED)
20/01/15 20:28:32 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1579120110995
final status: UNDEFINED
tracking URL: http://ip-100-0-24-154.ec2.internal:20888/proxy/application_1579119791561_0002/
user: hadoop
20/01/15 20:28:33 INFO Client: Application report for application_1579119791561_0002 (state: ACCEPTED)
20/01/15 20:28:34 INFO Client: Application report for application_1579119791561_0002 (state: ACCEPTED)
20/01/15 20:28:35 INFO Client: Application report for application_1579119791561_0002 (state: ACCEPTED)
20/01/15 20:28:36 INFO Client: Application report for application_1579119791561_0002 (state: ACCEPTED)
20/01/15 20:28:37 INFO Client: Application report for application_1579119791561_0002 (state: RUNNING)
20/01/15 20:28:37 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 100.0.18.190
ApplicationMaster RPC port: 0
queue: default
start time: 1579120110995
final status: UNDEFINED
tracking URL: http://ip-100-0-24-154.ec2.internal:20888/proxy/application_1579119791561_0002/
user: hadoop
20/01/15 20:28:38 INFO Client: Application report for application_1579119791561_0002 (state: RUNNING)
20/01/15 20:28:39 INFO Client: Application report for application_1579119791561_0002 (state: RUNNING)
..... omitted several of the following redundant lines in order to post
20/01/15 20:29:12 INFO Client: Application report for application_1579119791561_0002 (state: RUNNING)
20/01/15 20:34:54 INFO Client: Application report for application_1579119791561_0002 (state: FINISHED)
20/01/15 20:34:54 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 100.0.18.190
ApplicationMaster RPC port: 0
queue: default
start time: 1579120110995
final status: FAILED
tracking URL: http://ip-100-0-24-154.ec2.internal:20888/proxy/application_1579119791561_0002/
user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1579119791561_0002 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/01/15 20:34:54 INFO ShutdownHookManager: Shutdown hook called
20/01/15 20:34:54 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-c7351112-d3ef-4f42-abcb-c86e8cabaf6e
Command exiting with ret '1'
aws:
access_key_id:
secret_access_key:
s3:
region: us-east-1
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE
jsonpath_assets:
log: s3://xxxxxxxxx-emretlrunner-out/log
encrypted: false
raw:
in:
- s3://elasticbeanstalk-us-east-1-xxxxxxxx/resources/environments/logs/publish/e-xxxxxxx
- s3://elasticbeanstalk-us-east-1-xxxxxxxx/resources/environments/logs/publish/e-yyyyyyy
processing: s3://xxxxxxxx-emretlrunner-out/processing
archive: s3://xxxxxxxxx-emretlrunner-out/archive
enriched:
good: s3://xxxxxxxxx-emretlrunner-out/enriched/good
bad: s3://xxxxxxxxx-emretlrunner-out/enriched/bad
errors:
archive: s3://xxxxxxxx-emretlrunner-out/enriched/archive
shredded:
good: s3://xxxxxxxxx-emretlrunner-out/shredded/good
bad: s3://xxxxxxxxx-emretlrunner-out/shredded/bad
errors:
archive: s3://xxxxxxxx-emretlrunner-out/shredded/archive
consolidate_shredded_output: false
emr:
ami_version: 5.9.0
region: us-east-1
jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
placement:
ec2_subnet_id: subnet-0fxxxxxxxxxxxxxxxxxxxxxx
ec2_key_name: xxxxx-EmrEtlRunner-Snowplow
security_configuration:
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
lingual: # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
jobflow:
job_name: Snowplow ETL # Give your job a name
master_instance_type: c4.xlarge
core_instance_count: 3
core_instance_type: c4.xlarge
core_instance_bid: 0.09
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
volume_size: 100 # Gigabytes
volume_type: "gp2"
volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
ebs_optimized: false # Optional. Will default to true
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: "1"
spark:
maximizeResourceAllocation: "true"
additional_info: # Optional JSON string for selecting additional features
collectors:
format: clj-tomcat # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
versions:
spark_enrich: 1.18.0 # Version of the Spark Enrichment process
continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
versions:
rdb_loader: 0.14.0
rdb_shredder: 0.13.1 # Version of the Spark Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
tags: {} # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production
We experienced an outage due Maven Central changing its security policy: https://blog.sonatype.com/central-repository-moving-to-https. We missed this very important update and right now hot-fixing scripts that communicate with Maven Central.
We fixed AMI5 script around 6PM UTC, except one region (ap-southeast-2). Now we also fixed all AMI4 scripts and suspect that your particular setup is significantly outdated. It is EmrEtlRunner compontent that is reponsible for launching bootstrap scripts and we belive that it runs AMI4 script. Can you make sure you’re using a latest version of EmrEtlRunner?
Updating EmrEtlRunner should be necessary only if next run is also not successful - we updated AMI4 scripts 10 minutes ago.
@anton - It turns out the issue I reported was related to the size of the EMR instance type and not the change in Maven’s security policy. This wasn’t obvious from the log files. I guessed at increasing capacity and luckily it worked.
To avoid future confusion, are there any resources that provide rough guidelines on sizing your EMR instances for EmrEtlRunner?
@datawise, when running your job in Spark it’s important to have your Spark configuration configured properly. It is not just the instance type that matters. There is a rough correlation between the volume of enriched data and the EMR cluster size and Spark configuration.
The EC2 type c4 might not be the best choice for Spark job as all the processing on Spark is done in memory. In other words, you might be better off with r4 types instead.
You should be able to find lots of examples in this forum about how to configure Spark. For example, assuming you are using 1x r4.4xlarge core node in your EMR cluster your Spark configuration would look something like this
No problem @Tejas_Behra. To solve my EMR problem I used two r4.2xlarge instances.
aws:
access_key_id: xxxxxxxxxxxxxxx
secret_access_key: xxxxxxxxxxxxxxx
s3:
region: us-east-1
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE
jsonpath_assets:
log: s3://xxxxxxxx/log
encrypted: false
raw:
in:
- s3://xxxxxxxxxx/resources/environments/logs/publish/e-bma4hpmtqn
processing: s3://xxxxxxxxxx/processing
archive: s3://xxxxxxxxxx/archive
enriched:
good: s3://xxxxxxxxxx/enriched/good
bad: s3://xxxxxxxxxx/enriched/bad
errors:
archive: s3://xxxxxxxxxx/enriched/archive
shredded:
good: s3://xxxxxxxxxx/shredded/good
bad: s3://xxxxxxxxxx/shredded/bad
errors:
archive: s3://xxxxxxxxxx/shredded/archive
consolidate_shredded_output: false
emr:
ami_version: 5.9.0
region: us-east-1
jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
placement:
ec2_subnet_id: xxxxxxxxxxxxxxxx
ec2_key_name: RF-EmrEtlRunner-Snowplow
security_configuration:
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
lingual: # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
jobflow:
job_name: Snowplow ETL # Give your job a name
master_instance_type: r4.2xlarge
core_instance_count: 2
core_instance_type: r4.2xlarge
core_instance_bid: 0.17
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
volume_size: 100 # Gigabytes
volume_type: "gp2"
volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
ebs_optimized: false # Optional. Will default to true
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: "1"
spark:
maximizeResourceAllocation: "true"
additional_info: # Optional JSON string for selecting additional features
collectors:
format: clj-tomcat # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
versions:
spark_enrich: 1.18.0 # Version of the Spark Enrichment process
continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
versions:
rdb_loader: 0.14.0
rdb_shredder: 0.13.1 # Version of the Spark Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
tags: {} # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production