This is the enriched/good folder with my latest run.
Why is this happening? the command I run and the Command Line is
./r90-emr-etl-runner run --c config90.yml --r iglu_resolver.json --skip rdb_load
. The --skip rdb_load
is there because I have built a custom script to get the part-0000*
files from s3 to the gcp.
aws:
2 # Credentials can be hardcoded or set in environment variables
3 access_key_id: <%= ENV['AWS_ACCESS'] %>
4 secret_access_key: <%= ENV['AWS_SECRET'] %>
5 s3:
6 region: us-west-2
7 buckets:
8 assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
9 jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket her e
10 log: s3://xx-logs/logs
11 raw:
12 in: # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as b elow
13 - s3://xx-logs # e.g. s3://my-new-collector-bucket
14 processing: s3://xx/processing_data
15 archive: s3://xx/archive_data # e.g. s3://my-archive-bucket/raw
16 enriched:
17 good: s3://xx/enriched/good # e.g. s3://my-out-bucket/enriched/good
18 bad: s3://xx/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
19 errors: s3://xx/enriched/errors # Leave blank unless :continue_on_unexpected_error: set to true below
20 archive: s3://xx/enriched/archive # Where to archive enriched events to, e.g. s3://my-archive-bucket/enric hed
21 shredded:
22 good: s3://xxt/shredded/good # e.g. s3://my-out-bucket/shredded/good
23 bad: s3://xx/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
24 errors: s3://xx/shredded/errors # Leave blank unless :continue_on_unexpected_error: set to true below
25 archive: s3://xx/shredded/archive # Where to archive shredded events to, e.g. s3://my-archive-bucket/shred ded
26 emr:
27 ami_version: 5.5.0
28 region: us-west-2 # Always set this
29 jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
30 service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
31 placement: # Set this if not running in VPC. Leave blank otherwise
32 ec2_subnet_id: subnet-xxx # Set this if running in VPC. Leave blank otherwise
33 ec2_key_name: xx
34 bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
35 software:
36 hbase:
37 lingual: # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
38 # Adjust your Hadoop cluster below
39 jobflow:
40 job_name: Snowplow ETL # Give your job a name
41 master_instance_type: m1.medium
42 core_instance_count: 2
43 core_instance_type: m1.medium
44 core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
45 volume_size: 100 # Gigabytes
46 volume_type: "gp2"
47 volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
48 ebs_optimized: false # Optional. Will default to true
49 task_instance_count: 0 # Increase to use spot instances
50 task_instance_type: m1.medium
51 task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
52 bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
53 configuration:
54 yarn-site:
55 yarn.resourcemanager.am.max-attempts: "1"
56 spark:
57 maximizeResourceAllocation: "true"
58 additional_info: # Optional JSON string for selecting additional features
59 collectors:
60 format: cloudfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/w d_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
61 enrich:
62 versions:
63 spark_enrich: 1.9.0 # Version of the Spark Enrichment process
64 continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
65 output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, G ZIP
66 storage:
67 versions:
68 rdb_loader: 0.12.0
69 rdb_shredder: 0.12.0 # Version of the Spark Shredding process
70 hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
71 monitoring:
72 tags: {} # Name-value pairs describing this job
73 logging:
74 level: DEBUG # You can optionally switch to INFO for production
75 snowplow:
76 method: get
77 app_id: xx # e.g. snowplow
78 collector: xx.cloudfront.net
I am using the r90 version of the emr-etl-runner![34%20AM|690x160]