(ArgumentError) AWS EMR API Error (ValidationException): Size of step parameter length exceeded the maximum allowed

I’m using a Cloudfront collector and a Javascript tracker. I have all the enrichments that are mentioned here in the docs (except the weather enrichment). My iglu_resolver.json file is also the same as the one in the link. Now, when I try to run the ETL runner, I get the error I describe below.

My config.yml file is like (note that I don’t care about data modelling or storing the output data in redshift/postgres for now, I only want the enriched output data to be stored in S3 for now)-

aws:
  access_key_id: hidden
  secret_access_key: hidden
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets
      jsonpath_assets:
      log: s3://my-internal-bucket/snowplow-logs/
      encrypted: false
      raw:
        in:
          - s3://snowplow-logs-demo-my_name
        processing: s3://snowplow-my_name-processing/raw
        archive: s3://snowplow-enrichment-archive-my_name/archived/raw
      enriched:
        good: s3://snowplow-enrichment-archive-my_name/cloudfront/enriched/good
        bad: s3://snowplow-enrichment-archive-my_name/cloudfront/enriched/bad
        errors:
        archive: s3://snowplow-enrichment-archive-my_name/enriched/archive
      shredded:
        good: s3://snowplow-enrichment-archive-my_name/shredded/good
        bad: s3://snowplow-enrichment-archive-my_name/shredded/bad
        errors:
        archive: s3://snowplow-enrichment-archive-my_name/shredded/archive
    consolidate_shredded_output: false
  emr:
    ami_version: 5.9.0
    region: us-east-1
    jobflow_role: EMR_EC2_DefaultRole
    service_role: EMR_DefaultRole
    placement: us-east-1a
    ec2_subnet_id: subnet-12345
    ec2_key_name: my_key_name
    security_configuration:
    bootstrap: []
    software:
      hbase:
      lingual:
    # Adjust your Hadoop cluster below
    jobflow:
      job_name: Snowplow ETL
      master_instance_type: m4.large
      core_instance_count: 2
      core_instance_type: m4.large
      core_instance_ebs:
        volume_size: 100
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m4.large
      task_instance_bid: 0.015
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: cloudfront
enrich:
  versions:
    spark_enrich: 1.18.0
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: GZIP
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.1        # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {}
  logging:
    level: DEBUG # Optionally switch to INFO for production
  snowplow:
    method: get
    collector: sdfsdfsd.cloudfront.net
    app_id: snowplow-demo
    protocol: http
    port: 80

The error is like below-

ubuntu@ip-172-99-99-99:~$ ./snowplow-emr-etl-runner run -c config.yml -n enrichments -r iglu_resolver.json 
uri:classloader:/gems/avro-1.8.1/lib/avro/schema.rb:350: warning: constant ::Fixnum is deprecated
D, [2019-10-25T16:13:20.358510 #19182] DEBUG -- : Initializing EMR jobflow
ArgumentError: AWS EMR API Error (ValidationException): Size of step parameter length exceeded the maximum allowed.
                    submit at uri:classloader:/gems/elasticity-6.0.14/lib/elasticity/aws_session.rb:44
              run_job_flow at uri:classloader:/gems/elasticity-6.0.14/lib/elasticity/emr.rb:302
                       run at uri:classloader:/gems/elasticity-6.0.14/lib/elasticity/job_flow.rb:176
                       run at uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:738
                   send_to at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43
                 call_with at uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76
  block in redefine_method at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138
                       run at uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:138
                   send_to at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43
                 call_with at uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76
  block in redefine_method at uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138
                    <main> at uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41
                      load at org/jruby/RubyKernel.java:994
                    <main> at uri:classloader:/META-INF/main.rb:1
                   require at org/jruby/RubyKernel.java:970
                    (root) at uri:classloader:/META-INF/main.rb:1
                    <main> at uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1
ERROR: org.jruby.embed.EvalFailedException: (ArgumentError) AWS EMR API Error (ValidationException): Size of step parameter length exceeded the maximum allowed.

@kev5, unfortunately, AWS EMR API has a restriction on the number of characters submitted. Could you try to minify (remove spaces, return characters, etc.) all the enrichments configuration files and try again?

@ihor Thanks, that worked. I had to minify all the enrichments JSONs along with the iglu_resolver.json for it work. However, I took a look at the enriched results in S3:

enriched:
        good: s3://snowplow-enrichment-archive-my_name/cloudfront/enriched/good
        bad: s3://snowplow-enrichment-archive-my_name/cloudfront/enriched/bad

The data in enriched/good is 0 Bytes (run=2019-10-25-21-01-49_folder) while the data in enriched/bad/run=2019-10-25-21-01-49 is like in the image attached.

What might have gone wrong here?

@kev5, once the job has completed successfully the good data (files) are archived into enriched:archive folder (enriched:good becomes empty). The events that failed validation against the corresponding JSON schemas will go into enriched:bad bucket. That does not necessarily mean there is an implementation problem. However, it is advisable to examine your bad data to ensure of that.

Oh okay. Thanks for responding @ihor . One last question - I’m currently using snowplow_emr_r112_baalbek from here. What should I be expected to update (e.g. config.yml, iglu_resolver.json, any JSON in enrichments/ folder, etc.) if I need to upgrade this to a newer version, say r116 or something? Is there any documentation for this?

@ihor would be great to get your feedback on my last question. thanks!

@kev5, we have Upgrade Guide you can follow. You would have to review all the releases between your current version and the one you want to upgrade to as each could introduce something that would have to be taken in to account.