Errror while running StorageLoader

shashi · September 20, 2017, 12:05pm

we are using EmreEtlRunner in the bellow version
snowplow_emr_r88_angkor_wat.zip

While running this bellow Command
./snowplow-storage-loader --config 4-storage/config/emretlrunner.yml --resolver 4-storage/config/resolver.json --targets 4-storage/config/targets/ --skip analyze

Bellow is the error i am getting

	Unexpected error: JSON instance is not self-describing (schema property is absent):
	 {"$schema":"http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#","description":"Snowplow PostgreSQL storage configuration","self":{"vendor":"com.snowplowanalytics.snowplow.storage","name":"postgresql_config","format":"jsonschema","version":"1-0-0"},"type":"object","properties":{"name":{"type":"PostgreSQL enriched events storage"},"host":{"type":"localhost"},"database":{"type":"snowplow"},"port":{"type":"integer","minimum":1,"maximum":65535},"sslMode":{"type":"DISABLE","enum":["DISABLE","REQUIRE","VERIFY_CA","VERIFY_FULL"]},"schema":{"type":"atomic"},"username":{"type":"power_user"},"password":{"type":"hadoop"},"purpose":{"type":"ENRICHED_EVENTS","enum":["ENRICHED_EVENTS"]}},"additionalProperties":false,"required":["name","host","database","port","sslMode","schema","username","password","purpose"]}
	uri:classloader:/gems/iglu-ruby-client-0.1.0/lib/iglu-client/resolver.rb:92:in `get_schema_key'
	uri:classloader:/gems/iglu-ruby-client-0.1.0/lib/iglu-client/resolver.rb:66:in `parse'
	uri:classloader:/storage-loader/lib/snowplow-storage-loader/config.rb:55:in `get_config'
	uri:classloader:/storage-loader/bin/snowplow-storage-loader:31:in `<main>'
	org/jruby/RubyKernel.java:977:in `load'
	uri:classloader:/META-INF/main.rb:1:in `<main>'
	org/jruby/RubyKernel.java:959:in `require'
	uri:classloader:/META-INF/main.rb:1:in `(root)'
	uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'

Bellow is the Config.yml file

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: xxxxxxxxxxx
  secret_access_key: xxxxxxxxxxxxxxxxxx
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3://unilogregion1/logs
      raw:
        in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
          - s3://ilogregion1         # e.g. s3://my-old-collector-bucket
        processing: s3://ilogregion1/raw/processing
        archive: s3://ilogregion1/raw/archive   # e.g. s3://my-archive-bucket/raw
      enriched:
        good: s3://ilogregion1/enriched/good        # e.g. s3://my-out-bucket/enriched/good
        bad: s3://ilogregion1/enriched/bad       # e.g. s3://my-out-bucket/enriched/bad
        errors: s3://ilogregion1/enriched/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://ilogregion1/enriched/archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
      shredded:
        good: s3://ilogregion1/shredded/good        # e.g. s3://my-out-bucket/shredded/good
        bad: s3://ilogregion1/shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
        errors: s3://ilogregion1/shredded/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://ilogregion1/shredded/archive     # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
    ami_version: 5.5.0
    region: us-east-1a       # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement: us-east-1a     # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id:  # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: snowplow
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase: "0.92.0"               # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual: "1.1"              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      job_name: Snowplow ETL # Give your job a name
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: thrift # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
    spark_enrich: 1.9.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
    rdb_loader: 0.12.0
    rdb_shredder: 0.12.0        # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  #snowplow:
    #method: get
    #app_id: unilog # e.g. snowplow
    #collector: 172.31.38.39:8082 # e.g. d3rkrsqld9gmqf.cloudfront.net

Bellow is my resolver file

{
	"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
	"description": "Snowplow PostgreSQL storage configuration",
	"self": {
		"vendor": "com.snowplowanalytics.snowplow.storage",
		"name": "postgresql_config",
		"format": "jsonschema",
		"version": "1-0-0"
	},
	"type": "object",
	"properties": {
		"name": {
			"type": "PostgreSQL enriched events storage"
		},
		"host": {
			"type": "localhost"
		},
		"database": {
			"type": "snowplow"
		},
		"port": {
			"type": "integer",
			"minimum": 1,
			"maximum": 65535
		},
		"sslMode": {
			"type": "DISABLE",
			"enum": ["DISABLE", "REQUIRE", "VERIFY_CA", "VERIFY_FULL"]
		},
		"schema": {
			"type": "atomic"
		},
		"username": {
			"type": "power_user"
		},
		"password": {
			"type": "hadoop"
		},
		"purpose": {
			"type": "ENRICHED_EVENTS",
			"enum": ["ENRICHED_EVENTS"]
		}
	},
	"additionalProperties": false,
	"required": ["name", "host", "database", "port", "sslMode", "schema", "username", "password", "purpose"]
}

please help me is their any change i have do in my congfig file or command
How to run Storange loader with snowplow_emr_r91_stonehenge_rc9 version?

BenFradet · September 21, 2017, 11:30am

The resolver is supposed to point to a schema registry (e.g. https://github.com/snowplow/snowplow/blob/master/3-enrich/config/iglu_resolver.json) not be a schema for a specific type of event itself.

shashi · September 21, 2017, 2:39pm

hi @BenFradet thanks for the reply
I ran with iglu_resolver.json file with using below command.

./snowplow-storage-loader --config 4-storage/config/emretlrunner.yml --resolver 4-storage/config/iglu_resolver.json --targets 4-storage/config/targets/ --skip analyze

i am getting below error

Error in [redshift.json] The property '#/roleArn' was not of a minimum string length of 20 Shutting down

under --targets 4-storage/config/targets/ have 5 json files.
below is the screenshot of json files.

thank you
shashi

anton · September 21, 2017, 5:58pm

Hi @shashi,

Make sure you have a correct roleArn in your redshift.json configuration. It must look something like arn:aws:iam::719197435995:role/RedshiftLoadRole and simply cannot be shorter than 20 characters by its format definition. Most likely you have just RedshiftLoadRole, but need to provide full ARN URI.

Topic		Replies	Views
IgluError (JSON instance is not self-describing (schema property is absent) AWS batch pipeline (Legacy)	6	2245	October 5, 2017
How to run Storage Loader in PostgreSQL database Data store sources	1	2788	July 25, 2017
StorageLoader isn't working Storage targets	4	1820	March 27, 2018
Error while running postgresql storageloader Storage targets	4	1628	July 28, 2017
Storage Loader successful but not loading Redshift or Postgres DB Storage targets	4	2033	March 28, 2017

Errror while running StorageLoader

Related topics