Facing Error while executing the emr etl command

Vraj · January 6, 2020, 9:29am

I am trying to execute the command
./snowplow-emr-etl-runner run -c /home/etl_runner/config/snowplow_config.yml -r /home/etl_runner/config/iglu_resolver.json -t /home/etl_runner/targets/ --debug
My target directory is having my postgres.json file which specifies the details related to the database.
However while executing the above command i am facing error like " Data files not archived.

Below is the error I am receiving . Can some one please help to fix this ?
The same setup works fine in another region of aws.

D, [2020-01-06T09:09:50.629000 #21952] DEBUG – : Initializing EMR jobflow
D, [2020-01-06T09:09:57.773000 #21952] DEBUG – : EMR jobflow j-112455263 started, waiting for jobflow to complete…
I, [2020-01-06T09:12:00.362000 #21952] INFO – : No RDB Loader logs
F, [2020-01-06T09:12:00.790000 #21952] FATAL – :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-112455263 failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL: TERMINATED_WITH_ERRORS [VALIDATION_ERROR] ~ elapsed time n/a [ - 2020-01-06 09:11:15 UTC]

1. Elasticity S3DistCp Step: Shredded S3 -> Shredded Archive S3: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity S3DistCp Step: Enriched S3 -> Enriched Archive S3: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity Custom Jar Step: Load PostgreSQL enriched events storage Storage Target: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity S3DistCp Step: Raw Staging S3 -> Raw Archive S3: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity S3DistCp Step: Shredded HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity Spark Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity Custom Jar Step: Empty Raw HDFS: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity S3DistCp Step: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity Spark Step: Enrich Raw Events: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity S3DistCp Step: Raw S3 -> Raw HDFS: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity S3DistCp Step: Raw s3://snowplow-logs/ -> Raw Staging S3: CANCELLED ~ elapsed time n/a [ - ]
1. Elasticity Setup Hadoop Debugging: CANCELLED ~ elapsed time n/a [ - ]):
  uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:659:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
  uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
  uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:109:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
  uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
  uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41:in <main>' org/jruby/RubyKernel.java:979:in load’
  uri:classloader:/META-INF/main.rb:1:in <main>' org/jruby/RubyKernel.java:961:in require’
  uri:classloader:/META-INF/main.rb:1:in (root)' uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in ’

Jenni · January 6, 2020, 9:56am

Hi @Vraj - Could you provide us with a your config file, with all sensitive information removed?

This error can be due to the instance type you requested to spin EMR cluster is either not available in your availability zone or even region or you have reached the limit on the number of EC2 running concurrently

As stated here:[Snowplow::EmrEtlRunner::EmrExecutionError]

Vraj · January 6, 2020, 10:01am

Hi Jenni,
Please find the config file below
+++++++++++++++++++++++++
aws:

Credentials can be hardcoded or set in environment variables

access_key_id: <% accesskey %>
secret_access_key: <% secretkey %>
s3:
region: us-east-1
buckets:
assets: s3://snowplow-hosted-assets
jsonpath_assets:
log: s3://snowplow-data/etl_logs
encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-S3)
raw:
in:
- s3://snowplow-logs/ # will be wherever the cloudfront logs are going
processing: s3://snowplow-data/processing
archive: s3//snowplow-data/archive/raw
enriched:
good: s3://snowplow-data/enriched/good
bad: s3://snowplow-data/enriched/bad
errors: # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://snowplow-data/archive/enriched # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: s3://snowplow-data/shredded/good
bad: s3://snowplow-data/shredded/bad
errors:
archive: s3://snowplow-data/archive/shredded
consolidate_shredded_output: false
emr:
ami_version: 5.9.0
region: us-east-1
jobflow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole
placement:
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: snowplow
security_configuration:
bootstrap:
software:
hbase:
lingual:
# Adjust your Hadoop cluster below
jobflow:
job_name: Snowplow ETL # Give your job a name
master_instance_type: m4.large
core_instance_count: 2
core_instance_type: m4.large
core_instance_ebs:
volume_size: 100 # Gigabytes
volume_type: “gp2”
volume_iops: 400 # Optional. Will only be used if volume_type is “io1”
ebs_optimized: false # Optional. Will default to true
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m4.large
task_instance_bid: # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
configuration:
yarn-site:
yarn.resourcemanager.am.max-attempts: “1”
spark:
maximizeResourceAllocation: “true”
additional_info: # Optional JSON string for selecting additional features
collectors:
format: ‘tsv/com.amazon.aws.cloudfront/wd_access_log’ # For example: ‘clj-tomcat’ for the Clojure Collector, ‘thrift’ for Thrift records, ‘tsv/com.amazon.aws.cloudfront/wd_access_log’ for Cloudfront access logs or ‘ndjson/urbanairship.connect/v1’ for UrbanAirship Connect events
enrich:
versions:
spark_enrich: 1.18.0 # Version of the Spark Enrichment process
continue_on_unexpected_error: false # Set to ‘true’ (and set :out_errors: above) if you don’t want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
versions:
rdb_loader: 0.14.0
rdb_shredder: 0.13.1 # Version of the Spark Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
tags: {} # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production
++++++++++++++++++++++

Jenni · January 6, 2020, 10:58am

Hi @Vraj - sorry to be a bother. Would you mind posting it with the YAML format? Sometime the issue can stem from that. The easiest way would be to post it in a code block.

Vraj · January 6, 2020, 11:35am

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: <% accesskey %>
  secret_access_key: <% secretkey %>
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets 
      jsonpath_assets: 
      log: s3://snowplow-data/etl_logs
      encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-S3)
      raw:
        in:
          - s3://snowplow-logs/ # will be wherever the cloudfront logs are going
        processing: s3://snowplow-data/processing
        archive: s3//snowplow-data/archive/raw
      enriched:
        good: s3://snowplow-data/enriched/good
        bad: s3://snowplow-data/enriched/bad
        errors:     # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://snowplow-data/archive/enriched # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
      shredded:
        good: s3://snowplow-data/shredded/good
        bad: s3://snowplow-data/shredded/bad
        errors: 
        archive: s3://snowplow-data/archive/shredded
    consolidate_shredded_output: false
  emr:
    ami_version: 5.9.0
    region: us-east-1
    jobflow_role: EMR_EC2_DefaultRole
    service_role: EMR_DefaultRole
    placement: 
    ec2_subnet_id:  # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: snowplow
    security_configuration: 
    bootstrap: []
    software:
      hbase:
      lingual:
    # Adjust your Hadoop cluster below
    jobflow:
      job_name: Snowplow ETL # Give your job a name
      master_instance_type: m4.large
      core_instance_count: 2
      core_instance_type: m4.large
      core_instance_ebs:   
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m4.large
      task_instance_bid: # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: 'tsv/com.amazon.aws.cloudfront/wd_access_log' # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
    spark_enrich: 1.18.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.1        # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production

Vraj · January 6, 2020, 11:36am

Hi @Jenni : Please the above file in yml format

ihor · January 6, 2020, 11:22pm

@Vraj, the logs indicate the EMR cluster has actually started - it is not an issue with wrong/unavailable instance type. There is a validation issue somewhere else.

Could you check EMR cluster logs for more details (via AWS console) in case some extra info will be presented there? What version of EmrEtlRunner are you using?

Vraj · January 7, 2020, 4:38am

@ihor : sure I will check the EMR logs. I am using snowplow_emr_r102_afontova_gora.zip package verion. Also the same setup and version is working fine in another account.
Why such issue i am facing over here then ? Any idea over this ?

Vraj · January 7, 2020, 5:31am

Hi @ihor : I have checked the logs and couldn’t find any things more in detail. It just says the step got cancelled and no more reason has been provided.
Logs looks some thing like this :
Cancellation request has succeeded for cluster step s-2KK7M4M0RHRX6 (Elasticity Setup Hadoop D…) in Amazon EMR cluster j-25461455G198 (Snowplow ETL) at 2020-01-06 10:54 UTC, and the step is now cancelled.

Vraj · January 8, 2020, 8:11am

I am able to resolve this issue .It was due to IAM role issue ,as it was not reading the correct role.
Thanks

Topic		Replies	Views
Emretlrunner executionerror data files not archived AWS batch pipeline (Legacy)	3	1459	September 27, 2017
Snowplow::EmrEtlRunner::EmrExecutionError Enrichment	3	1202	April 25, 2019
Steps Elasticity S3DistCp Step: Raw Staging S3 -> Raw Archive S3	13	1238	January 17, 2020
EmrEtlRunner::EmrExecutionError AWS batch pipeline (Legacy)	3	1770	October 5, 2017
Problem at S3 to HDFS S3DistCp step AWS batch pipeline (Legacy)	19	7280	June 4, 2021

Facing Error while executing the emr etl command

Credentials can be hardcoded or set in environment variables

Related topics