After setting up the S3 loader to get data from enriched stream into S3, I try to set up the next step to get data into my final destination AWS Redshift.
When I reading articles, I am getting confused. For these 3 things, RDB Loader, Storage Loader, EmrEtlRunner, could you guys please help me to understand what are these? Which one is the right one for me? The github tutorial keep back and run in these 3 things and I was lost.
@AllenWeieiei, RDB Loader is a new name for Storage Loader. You need to set up batch job orchestrated by EmrEtlRunner. The RDB Loader will be part of this data processing (specify it in EmrEtlRunner configuration file). To set up the job for this task you can start your journey from https://github.com/snowplow/snowplow/wiki/1-Installing-EmrEtlRunner. Do bear in mind 2 way of processing the data depending on if your load streamed data from Raw stream of from Enriched stream. The latter will enable Stream Enrich mode for EmrEtlRunner. The diagram depicting both modes is here.
@AllenWeieiei, no, you do not need to install RDB Loader but you would need to provide the Redshift configuration. The RDB Loader app will be pulled into EMR cluster by EmrEtlRunner from our hosted assets location during pipeline job execution.
I have installed the EmrEtlRunner on my server and need to do the configuration.
for the S3 part, I have some questions:
s3:
region: us-east-1
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: ADD HERE
encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-S3)
raw:
in: # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
- ADD HERE # e.g. s3://my-old-collector-bucket
- ADD HERE # e.g. s3://my-new-collector-bucket
processing: ADD HERE
archive: ADD HERE # e.g. s3://my-archive-bucket/raw
enriched:
good: ADD HERE # e.g. s3://my-out-bucket/enriched/good
bad: ADD HERE # e.g. s3://my-out-bucket/enriched/bad
errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below
archive: ADD HERE # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: ADD HERE # e.g. s3://my-out-bucket/shredded/good
bad: ADD HERE # e.g. s3://my-out-bucket/shredded/bad
errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below
archive: ADD HERE # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
consolidate_shredded_output: false # Whether to combine files when copying from hdfs to s3
Now I have 2 kinesis streams which are containing raw data from collector (good and bad), 3 streams are containing enriched data (good, bad, pii), 1 is containing s3 loader process fail. For S3 bucket, I have one as the destination of the S3 loader which is loading data from Stream-enriched-good to S3.
For the YAML configuration file of EMRETLRunner, we need raw:in(in, processing, archive), enriched(good, bad, errors, archive), shredded(good, bad, errors, archive), consolidate_shredded_output.
How do I give them value? Create a bucket for each type of these and sign them in?
@AllenWeieiei, since you load the enriched stream data to S3 then I would suggest running EmrEtlRunner in Stream Enrich mode. That implies an extra bucket in your configuration file, namely enriched:stream (bucket you sink your enriched stream data) and you do not need raw:in bucket at all.
enriched:stream is the bucket where I put my enriched event to through S3 loader from the stream, so what is enriched good and bad? I don’t have any bucket for them, those are just in a stream. Also, I did not set any process to shred data yet, do I need to do something about this before the EmrEtlRunner?
@ihor especially the subnet id, for the key name when I create the key pair I will have it. For the subnet id, do I have to create a subnet for the VPC and use it’s ID?