Hi Snowplow,
I have a question regarding the EMR-ETL-RUNNER. I successfully got the raw data files from the tracker enriched and parsed. I noticed in the raw files, it shows a guide describing which each field is like-
2 #Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-quer y cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl -cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields
In the enriched version, there is no description of each field. I am using the r88 version, and the generic iglu_reslover.json file. I am trying to decipher which each line of code is, and have looked at http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#and iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1 schema.
It looks close but not exact, any other ones that may be it? I’m guessing some of the fields may be left out, or it’s not going to be an exact match for a couple of reasons right?
Indeed, quite a few fields are expected to be left out. I assume your enriched data is in TSV format. A missed value would be indicated by no value between adjacent tabs.
Another question, How do I properly set the target. I want to have my target as a separate s3 bucket from which I plan on getting the data into my data warehouse. For example here is my script for running emr-etl-runner ./r88-emr-etl-runner --c configr88.yml --r iglu_resolver.json (I’m trying to do something like) --t s://target-snowplow. Do I need to reference it to a file, because obviously explicitly setting the target this way is not going to work. Is it one of the json schemas in 4-storage? I guess I’m kinda wondering if I make my own json file to do this?
@morris206, if you do not intend to load the data into any other target (Redshift, Snowflake, Postgres, Elasticsearch), then you do not need to use -t option at all. The last step in ETL is archiving data to S3. You might need to skip data load step though. Also, do you need files/events in S3 raw, enriched or/and shredded?
Without the --t step, I cannot run the runner more than once. Because there are files in the enriched and shredded folders etc. It will not let me run it again. This really my only issue. If i can bypass this, and get it to run again I can take it from there.
Thanks for the help. The r88 version is working fine, is there a reason why I should upgrade? -x {staging,enrich,shred,elasticsearch,archive_raw,rdb_load,consistency_check,analyze,load_manifest_check,archive_enriched,archive_shredded,staging_stream_enrich}, --skip skip the specified step(s) I don’t see an option for target unless it’s named something else.
What I really want is to link it back into a S3 bucket, not skip it. Do you know of a way to do this? The reason why is because I want to keep in it as batches rather than real-time. This is the way that my company and I decided we wanted to do it. Ideally I would keep it in the AWS pipeline this way, and from the target bucket write a custom script to get into BigQuery. Thank you.
Morris
@morris206, you are missing the point here. You do not need targets at all if you do not load the data (step load for r88 and rdb_load for later versions) into Redshift (or other data stores as per my comments earlier). Having the data in S3 is achieved by archiving it (see step 12 of the diagram archive_enriched). In any case, if not archived you cannot run EmrEtlRunner.
I upgraded to the r90 version because it gives the option of skipping rdb_load, but it’s still not letting me run. Could you be a bit more explicit with what you’re trying to say?
What part I am I missing where the data doesn’t automatically get archived?
@morris206, you cannot run EmrEtlRunner if the files have not been archived. You need to follow the dataflow diagram to recover the pipeline correctly (see instructions under the diagram - for R90 the diagram is here). Once you have recovered the pipeline you need to use the appropriate skip options to keep the batch pipeline running.
What is your current pipeline architecture and what are you trying to achieve?
Right now, I have the runner working properly, I just upgraded to r90 to be able to have the --skip rdb_load option. I want to from here, be able to schedule the runner for every 2 hrs to run because we want the final product to be a batch process not a real-time process. So, my only concern from this point is to be able to get it to run. Obviously the data is not being archived properly because I cannot run the emr-etl-runner more than once. This is my main concern. I am eventually going to write a script to get the data into our Google environment. That’s all. So once again, I just want to be able to run the runner more than once. I get that the data needs to be archived. How can I achieve this? Here is my config file-
aws:
2 # Credentials can be hardcoded or set in environment variables
3 access_key_id: <%= ENV['AWS_ACCESS_KEY'] %>
4 secret_access_key: <%= ENV['AWS_SECRET_KEY'] %>
5 s3:
6 region: us-west-2
7 buckets:
8 assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
9 jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
10 log: s3://xxx/logs
12 in: # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
13 - s3://xxx # e.g. s3://my-new-collector-bucket
14 processing: s3://xxx/processing_data
15 archive: s3://xxx/archive_data # e.g. s3://my-archive-bucket/raw
16 enriched:
17 good: s3://xxx/enriched/good # e.g. s3://my-out-bucket/enriched/good
18 bad: s3://xxx/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
19 errors: s3://xxx/enriched/errors # Leave blank unless :continue_on_unexpected_error: set to true below
20 archive: s3://xxx/enriched/archive # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
21 shredded:
22 good: s3://xxx/shredded/good # e.g. s3://my-out-bucket/shredded/good
23 bad: s3://xxx/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
24 errors: s3://xxx/shredded/errors # Leave blank unless :continue_on_unexpected_error: set to true below
25 archive: s3://sxxx/shredded/archive # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
26 emr:
27 ami_version: 5.5.0
28 region: us-west-2 # Always set this
29 jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
30 service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
31 placement: # Set this if not running in VPC. Leave blank otherwise
32 ec2_subnet_id: subnet-xxx # Set this if running in VPC. Leave blank otherwise
33 ec2_key_name: xxx_track
34 bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
35 software:
36 hbase: # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
37 lingual: # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
38 # Adjust your Hadoop cluster below
39 jobflow:
40 job_name: Snowplow ETL # Give your job a name
41 master_instance_type: m1.medium
42 core_instance_count: 2
43 core_instance_type: m1.medium
44 core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
45 volume_size: 100 # Gigabytes
46 volume_type: "gp2"
47 volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
48 ebs_optimized: false # Optional. Will default to true
49 task_instance_count: 0 # Increase to use spot instances
50 task_instance_type: m1.medium
51 task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
52 bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
53 configuration:
54 yarn-site:
55 yarn.resourcemanager.am.max-attempts: "1"
56 spark:
57 maximizeResourceAllocation: "true"
58 additional_info: # Optional JSON string for selecting additional features
59 collectors:
60 format: cloudfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
61 enrich:
62 versions:
63 spark_enrich: 1.9.0 # Version of the Spark Enrichment process
64 continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
65 output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
66 storage:
67 versions:
68 rdb_loader: 0.12.0
69 rdb_shredder: 0.12.0 # Version of the Spark Shredding process
70 hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
71 monitoring:
72 tags: {} # Name-value pairs describing this job
73 logging:
74 level: DEBUG # You can optionally switch to INFO for production
75 snowplow:
76 method: get
77 app_id: atwork # e.g. snowplow
78 collector: xxx.cloudfront.net
~ ```
@morris206, have followed the instructions on the dataflow wiki? Have you recovered your pipeline?
As long as any of the processing, enriched/good or shredded/good buckets is not empty, you cannot run EmrEtlRunner. Depending on where you have files present use the appropriate recovery step (or even archive/move the files manually). Once empty, you need to ensure your CLI command is correct for your scenario. If you do not intend to use Redshift, the skip options for your EmrElRunner would be --skip shred,rdb_load,archive_shredded (after you have recovered the pipeline from the current state). Do, please, review the wiki I pointed out to to understand the steps to skip.