Credentials can be hardcoded or set in environment variables
access_key_id: XXX
secret_access_key: XXX
region: us-west-2
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: s3://canvas-snowplow-logs/etl-logs
- s3://canvas-snowplow-logs # Multiple in buckets are permitted
processing: s3://canvas-snowplow-logs/processing
archive: s3://canvas-snowplow-logs/archive # e.g. s3://my-archive-bucket/in
good: s3://canvas-snowplow-logs/enriched/good # e.g. s3://my-out-bucket/enriched/good
bad: s3://canvas-snowplow-logs/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
errors: s3://canvas-snowplow-logs/enriched/errors # Leave blank unless continue_on_unexpected_error: set to true below
archive: s3://canvas-snowplow-logs/enriched/archive # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
good: s3://canvas-snowplow-logs/shredded/good # e.g. s3://my-out-bucket/shredded/good
bad: s3://canvas-snowplow-logs/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
errors: s3://canvas-snowplow-logs/shredded/errors # Leave blank unless continue_on_unexpected_error: set to true below
archive: s3://canvas-snowplow-logs/shredded/archive # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
ami_version: 3.6.0 # Don’t change this
region: us-west-2 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using aws emr create-default-roles
service_role: EMR_DefaultRole # Created using aws emr create-default-roles
placement: us-west-2a # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: canvasSnowplowAnalytics
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
hbase: # To launch on cluster, provide version, “0.92.0”, keep quotes
lingual: “1.1” # To launch on cluster, provide version, “1.1”, keep quotes
# Adjust your Hadoop cluster below
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
format: cloudfront # Or ‘clj-tomcat’ for the Clojure Collector, or ‘thrift’ for Thrift records, or ‘tsv/’ for Cloudfront access logs
job_name: Snowplow canvas ETL # Give your job a name
hadoop_enrich: 1.5.1 # Version of the Hadoop Enrichment process
hadoop_shred: 0.7.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
continue_on_unexpected_error: false # Set to ‘true’ (and set out_errors: above) if you don’t want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
- name: "Canvas snowplow database"
type: redshift
host: XXXX # The endpoint as shown in the Redshift console
database: logs # Name of database
port: XXX # Default Redshift port
username: canvas
password: XXXX
maxerror: 1 # Stop loading on first error, or increase to permit more load errors
comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
ssl_mode: disable
tags: {} # Name-value pairs describing this job
level: DEBUG # You can optionally switch to INFO for production
method: get
app_id: “Canvas snowplow” # e.g. snowplow
# Credentials can be hardcoded or set in environment variables
access_key_id: XXX
secret_access_key: XXX
region: us-west-2
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: s3://canvas-snowplow-logs/etl-logs
- s3://canvas-snowplow-logs # Multiple in buckets are permitted
processing: s3://canvas-snowplow-logs/processing
archive: s3://canvas-snowplow-logs/archive # e.g. s3://my-archive-bucket/in
good: s3://canvas-snowplow-logs/enriched/good # e.g. s3://my-out-bucket/enriched/good
bad: s3://canvas-snowplow-logs/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
errors: s3://canvas-snowplow-logs/enriched/errors # Leave blank unless continue_on_unexpected_error: set to true below
archive: s3://canvas-snowplow-logs/enriched/archive # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
good: s3://canvas-snowplow-logs/shredded/good # e.g. s3://my-out-bucket/shredded/good
bad: s3://canvas-snowplow-logs/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
errors: s3://canvas-snowplow-logs/shredded/errors # Leave blank unless continue_on_unexpected_error: set to true below
archive: s3://canvas-snowplow-logs/shredded/archive # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
ami_version: 3.6.0 # Don't change this
region: us-west-2 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
placement: us-west-2a # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: canvasSnowplowAnalytics
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
hbase: # To launch on cluster, provide version, "0.92.0", keep quotes
lingual: "1.1" # To launch on cluster, provide version, "1.1", keep quotes
# Adjust your Hadoop cluster below
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
format: cloudfront # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/' for Cloudfront access logs
job_name: Snowplow canvas ETL # Give your job a name
hadoop_enrich: 1.5.1 # Version of the Hadoop Enrichment process
hadoop_shred: 0.7.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
continue_on_unexpected_error: false # Set to 'true' (and set out_errors: above) if you don't want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
- name: "Canvas snowplow database"
type: redshift
host: XXXX # The endpoint as shown in the Redshift console
database: logs # Name of database
port: XXX # Default Redshift port
username: canvas
password: XXXX
maxerror: 1 # Stop loading on first error, or increase to permit more load errors
comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
ssl_mode: disable
tags: {} # Name-value pairs describing this job
level: DEBUG # You can optionally switch to INFO for production
method: get
app_id: "Canvas snowplow" # e.g. snowplow
Ah - you have your processing bucket inside your in bucket. Never do this, it creates a circular reference. There is a warning about this in the documentation:
Important 2: do not put your raw:processing inside your raw:in bucket, or your enriched:good inside your raw:processing, or you will create circular references which EmrEtlRunner cannot resolve when moving files.
One more thing, my job got failed in the processing stage. There are few more logs in the raw logs folder. I want both the logs to be moved to the database. Can you guide me what are the changes i have to do in the config file to successfully move these logs to database.