It looks like the EMR AMI version (release label) is 4.5.0 in the configuration file (the first error you are seeing) but 4.5.0 isn’t available in ap-south-1, however it’s available in any of the other Asia Pacific regions.
If you attempt to spin EMR up in one of these regions it should work fine, otherwise it may (?) be possible to bump the EMR version in the config to 4.6.1 but I’m not too sure if that’d work.
Hi @mike thanks for the quick response. Can you please tell me how i can check which EMR AMI version is available in the regions given by amazon. Is there any document amazon has given or we have to check manually.
The cluster is now launching successfully in ap-south-1 region but the job is still not running. It gives me Bootstrap failure. I have not given any bootstrap step in my config file. I have not configured storage part yet so my conf is missing those entries.
Config.yml file
# Credentials can be hardcoded or set in environment variables
access_key_id: *********
secret_access_key: ****************
s3:
region: ap-south-1
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: s3://udmd-d-storage/udmd-d-etl/logs
raw:
in: # Multiple in buckets are permitted
- s3://elasticbeanstalk-ap-south-1-872626332308/resources/environments/logs/publish/e-3g6bah32p3 # e.g. s3://my-in-bucket
processing: s3://udmd-d-storage/udmd-d-etl
archive: s3://udmd-d-storage/udmd-d-archive # e.g. s3://my-archive-bucket/raw
enriched:
good: s3://udmd-d-storage/udmd-d-enriched/enriched/good # e.g. s3://my-out-bucket/enriched/good
bad: s3://udmd-d-storage/udmd-d-enriched/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
errors: s3://udmd-d-storage/udmd-d-enriched/enriched/errors # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://udmd-d-storage/udmd-d-archive/enrich/good # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: s3://udmd-d-storage/udmd-d-enriched/shredded/good # e.g. s3://my-out-bucket/shredded/good
bad: s3://udmd-d-storage/udmd-d-enriched/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
errors: s3://udmd-d-storage/udmd-d-enriched/shredded/errors # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://udmd-d-storage/udmd-d-archive/shredded/good # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
emr:
ami_version: 4.6.1 # Don't change this
region: ap-south-1 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
ec2_subnet_id: subnet-91c2e2db # Set this if running in VPC. Leave blank otherwise
ec2_key_name: DemoEnricherKeyPair
# bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
lingual: # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
jobflow:
master_instance_type: c4.large
core_instance_count: 2
core_instance_type: c4.large
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
additional_info: # Optional JSON string for selecting additional features
collectors:
format: clj-tomcat # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
job_name: Snowplow ETL # Give your job a name
versions:
hadoop_enrich: 1.8.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.9.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
download:
folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
targets:
- name: "My Redshift database"
type: redshift
host: ADD HERE # The endpoint as shown in the Redshift console
database: ADD HERE # Name of database
port: 5439 # Default Redshift port
ssl_mode: disable # One of disable (default), require, verify-ca or verify-full
table: atomic.events
username: ADD HERE
password: ADD HERE
maxerror: 1 # Stop loading on first error, or increase to permit more load errors
comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
- name: "My Elasticsearch database"
type: elasticsearch
host: ADD HERE # The Elasticsearch endpoint
database: ADD HERE # Name of index
port: 9200 # Default Elasticsearch port - change to 80 if using Amazon Elasticsearch Service
sources: # Leave blank to write the bad rows created in this run to Elasticsearch, or explicitly provide an array of bad row buckets like ["s3://my-enriched-bucket/bad/run=2015-10-06-15-25-53"]
ssl_mode: # Not required for Elasticsearch
table: ADD HERE # Name of type
username: # Not required for Elasticsearch
password: # Not required for Elasticsearch
es_nodes_wan_only: false # Set to true if using Amazon Elasticsearch Service
maxerror: # Not required for Elasticsearch
comprows: # Not required for Elasticsearch
monitoring:
tags: {} # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production
snowplow:
method: get
app_id: clojureCollectorDem-env # e.g. snowplow
collector: ec2-52-66-165-150.ap-south-1.compute.amazonaws.com # e.g. d3rkrsqld9gmqf.cloudfront.net
@alex@mike can you please help me what bootstrap steps is the job trying to invoke. Is this can be a region issue. In EMR console, below is the bootstrap step which is failing.
Its my 5th day and i am not able to set up snowplow.Can anyone please help me with this issue. It will be a great help. I have tried this in 3 regions but not able to run it single time.
This is the error log for EU region running R75 build and getting the error
ok i didn’t read the full thing. you are using ami_version 4.6.1? In EU region ami_version 4.5.0 (that’s what we currently using, and i think it’s the official one) should work fine, rest of config looks ok for me.
receiving same error. Now, I am using ami_version: 4.5.0 and rest is same
hadoop_enrich: 1.8.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.9.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0
with r83-bald-eagle release.
Appreciate your help @ecoron .
hmm, just tested the same release for an upgrade some days ago, works fine. But can you check in the aws emr console, if you can see in which step it is failing. it could be some passed arguments or values raises the max allowed length. (https://groups.google.com/forum/#!topic/mrjob/mX00_EElZoY)
The cluster is not launching. So unable to see the logs. The runner is failing before the cluster launched.
My hadoop_enrich & hadoop_shred version are correct @ecoron ?
I am running this
./snowplow-emr-etl-runner --config ~/snowplow-master/3-enrich/emr-etl-runner/config/config.yml.sample --resolver ~/snowplow-master/3-enrich/config/iglu_resolver.json --enrichments ~/snowplow-master/3-enrich/config/enrichments/ --skip elasticsearch,staging
I think there is some configuration mismatch between the build the hadoop enrich , shred jars the ami version, not sure. Waiting for your reply.
is there something else customized like in enrichments or resolver?
you should see some console output like
DEBUG – : Staging raw logs…
…
DEBUG – : Waiting a minute to allow S3 to settle (eventual consistency)
DEBUG – : Initializing EMR jobflow
DEBUG – : EMR jobflow j-XXXXXXXXXXX started, waiting for jobflow to complete…
Basically for the Snowplow pipeline to run we have to deploy an array of hosted assets to a public S3 bucket in each AWS region.
Unfortunately for you we hadn’t yet set this up for Mumbai (ap-south-1). I have made a ticket for this now:
And we are now running the sync process. If all goes well, the sync should be complete in about an hour, and you can try again then.
Apologies for the confusion. It looks like you are the first Snowplow user in ap-south-1 - so let us know if you encounter any further problems downstream.
I have started the same job multiple times and getting the same error. Can you please help.
P.S. : Standard Error I am getting on Emr console:
Exception in thread "main" cascading.flow.FlowException: step failed: (3/3) ...d/run=2016-10-03-07-25-44, with job id: job_1475479683156_0003, please see cluster logs for failure messages
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:221)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Hi @deepak - you’ll need to dig through the logs (there’s a wiki link shared in the error message that should help) to figure out why the job is failing after 7 minutes in Hadoop Enrich. Let us know what you find out!