Iglu Configuration Issues

I am running r88 emr-etl-runner.

Snowplow ETL: TERMINATED_WITH_ERRORS [VALIDATION_ERROR] ~ elapsed time n/a [ - 2018-11-27 13:30:22 -0800]
 - 1. Elasticity S3DistCp Step: Raw S3 Staging -> S3 Archive: CANCELLED ~ elapsed time n/a [ - ]
 - 2. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 3. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
 - 4. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 5. Elasticity S3DistCp Step: Enriched HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 6. Elasticity Scalding Step: Enrich Raw Events: CANCELLED ~ elapsed time n/a [ - ]
 - 7. Elasticity S3DistCp Step: Raw S3 -> HDFS: CANCELLED ~ elapsed time n/a [ - ]):
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:500:in `run'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:74:in `run'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
    uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
    uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in `<main>'
    org/jruby/RubyKernel.java:973:in `load'
    uri:classloader:/META-INF/main.rb:1:in `<main>'
    org/jruby/RubyKernel.java:955:in `require'
    uri:classloader:/META-INF/main.rb:1:in `(root)'
    uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'

Is the message.

I am using the generic iglu_resolver.json file without any changes to it.

  1 {
  2   "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  3   "data": {
  4     "cacheSize": 500,
  5     "repositories": [
  6       {
  7         "name": "Iglu Central",
  8         "priority": 0,
  9         "vendorPrefixes": [ "com.snowplowanalytics" ],
 10         "connection": {
 11           "http": {
 12             "uri": "http://iglucentral.com"
 13           }
 14         }
 15       },
 16       {
 17         "name": "Iglu Central - GCP Mirror",
 18         "priority": 1,
 19         "vendorPrefixes": [ "com.snowplowanalytics" ],
 20         "connection": {
 21           "http": {
 22             "uri": "http://mirror01.iglucentral.com"
 23           }
 24         }
 25       }
 26     ]
 27   }
 28 }
~                 

here is my config.yml

  2   # Credentials can be hardcoded or set in environment variables
  3   access_key_id: <%= ENV['AWS_ACCESS_KEY'] %>
  4   secret_access_key: <%= ENV['AWS_SECRET_KEY'] %>
  5   s3:
  6     region: us-west-2
  7     buckets:
  8       assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
  9       jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own b    ucket here
 10       log: s3://samuel-web-track-logs/logs
 11       raw:
 12         in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the ar    ray, as below
 13           - s3://samuel-web-track-logs         # e.g. s3://my-old-collector-bucket
 14         processing: s3://samuel-web-track-logs/processing_data
 15         archive: s3://samuel-web-track-logs/archive_data    # e.g. s3://my-archive-bucket/raw
 16       enriched:
 17         good: s3://samuel-web-track-logs/enriched/good       # e.g. s3://my-out-bucket/enriched/good
 18         bad: s3://samuel-web-track-logs/enriched/bad        # e.g. s3://my-out-bucket/enriched/bad
 19         errors: s3://samuel-web-track-logs/enriched/errors    # Leave blank unless :continue_on_unexpected_error: set to true be    low
 20         archive: s3://samuel-web-track-logs/enriched/archive    # Where to archive enriched events to, e.g. s3://my-archive-buck    et/enriched
 21       shredded:
 22         good: s3://samuel-web-track-logs/shredded/good       # e.g. s3://my-out-bucket/shredded/good
 23         bad: s3://samuel-web-track-logs/shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
 24         errors: s3://samuel-web-track-logs/shredded/errors     # Leave blank unless :continue_on_unexpected_error: set to true b    elow
 25         archive: s3://samuel-web-track-logs/shredded/archive    # Where to archive shredded events to, e.g. s3://my-archive-buck    et/shredded
 26   emr:
 27     ami_version: 4.5.0
 28     region: us-west-2        # Always set this
 29     jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
 30     service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
 31     placement:      # Set this if not running in VPC. Leave blank otherwise
 32     ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
 33     ec2_key_name: xxx_samuel
 34     bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
 35     software:
 36       hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
 37       lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
 38     # Adjust your Hadoop cluster below
 39     jobflow:
 40       master_instance_type: m1.medium
 41       core_instance_count: 2
 42       core_instance_type: m1.medium
 43       core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
 44         volume_size: 100    # Gigabytes
 45         volume_type: "gp2"
 46         volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
 47         ebs_optimized: false # Optional. Will default to true
 48       task_instance_count: 0 # Increase to use spot instances
 49       task_instance_type: m1.medium
 50       task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
 51     bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
 52     additional_info:        # Optional JSON string for selecting additional features
 53 collectors:
 54   format: cloudfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.clo    udfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
 55 enrich:
 56   job_name: Snowplow ETL # Give your job a name
 57   versions:
 58     hadoop_enrich: 1.8.0 # Version of the Hadoop Enrichment process
 59     hadoop_shred: 0.11.0 # Version of the Hadoop Shredding process
 60     hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
 61   continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from     ETL
 62   output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats    : NONE, GZIP
 63 storage:
 64   download:
 65     folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
 66 monitoring:
 67   tags: {} # Name-value pairs describing this job
 68   logging:
 69     level: DEBUG # You can optionally switch to INFO for production
 70   snowplow:
 71     method: get
 72     app_id: atwork # e.g. snowplow
 73     collector: xxx.cloudfront.net # e.g. d3rkrsqld9gmqf.cloudfront.net
~                                                                                                                                   
~                             

The documentation on this subject is very sparse and vague to me. I would appreciate any help I can get, Thanks! `

For validation errors you should be able to to see some more information in the AWS EMR console about why the validation error occurred (permissions, invalid instance types etc).

Okay, it says there is no default VPC found.

I figured out this issue. In the config file I defined an EC2 key name without a subnet ID or placement. I found the info on the EC2 Dashboard under the Instance. Hope this helps someone. Thank you.

1 Like