We have implemented the snowplow pipeline to the point that “stream-enrich-kinesis” writes TSV files into the S3 loader bucket. At this point we decided to shred our enriched events into separate entities, using the RDB Shredder. It seems that RDB Shredder is part of emretlrunner batch process, however it is also mentioned here that it can be run manually. Now I have the following questions regarding the Shredder setup:
-
Does running Shredder manually actually mean to use this Snowplow hosted asset?
-
What is the difference between what “stream-enrich-kinesis” does and the enrichment of emretlrunner?
-
Can we actually run emretlrunner (in case it is necessary to do the Shredding job) within a Fargate instance? Else, what is the recommended implementation?
-
How should the emretlrunner config file be setup for the whole s3 block (below), in our case? We have only one bucket that collects the good enriched events). Are the buckets in the block, all required? Are they all outputs for the Shredder? In case not, which one is required for which steps?
s3:
region: ADD HERE
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: ADD HERE
encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-S3)
raw:
in: # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
- ADD HERE # e.g. s3://my-old-collector-bucket
- ADD HERE # e.g. s3://my-new-collector-bucket
processing: ADD HERE
archive: ADD HERE # e.g. s3://my-archive-bucket/raw
enriched:
good: ADD HERE # e.g. s3://my-out-bucket/enriched/good
bad: ADD HERE # e.g. s3://my-out-bucket/enriched/bad
errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below
archive: ADD HERE # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: ADD HERE # e.g. s3://my-out-bucket/shredded/good
bad: ADD HERE # e.g. s3://my-out-bucket/shredded/bad
errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below
archive: ADD HERE # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
consolidate_shredded_output: true # Whether to combine files when copying from hdfs to s3