RDB Loader, Storage Loader, EmrEtlRunner

Hi Team,

After setting up the S3 loader to get data from enriched stream into S3, I try to set up the next step to get data into my final destination AWS Redshift.

When I reading articles, I am getting confused. For these 3 things, RDB Loader, Storage Loader, EmrEtlRunner, could you guys please help me to understand what are these? Which one is the right one for me? The github tutorial keep back and run in these 3 things and I was lost.

Thank you so much!

@AllenWeieiei, RDB Loader is a new name for Storage Loader. You need to set up batch job orchestrated by EmrEtlRunner. The RDB Loader will be part of this data processing (specify it in EmrEtlRunner configuration file). To set up the job for this task you can start your journey from https://github.com/snowplow/snowplow/wiki/1-Installing-EmrEtlRunner. Do bear in mind 2 way of processing the data depending on if your load streamed data from Raw stream of from Enriched stream. The latter will enable Stream Enrich mode for EmrEtlRunner. The diagram depicting both modes is here.

Thanks @ihor.

So I need to install RDB Loader and EmrEtlRunner, then use EmrEtlRunner to orchestrate RDB Loader to get the whole process working?

@AllenWeieiei, no, you do not need to install RDB Loader but you would need to provide the Redshift configuration. The RDB Loader app will be pulled into EMR cluster by EmrEtlRunner from our hosted assets location during pipeline job execution.

Check out EmrEtlRunner usage here: https://github.com/snowplow/snowplow/wiki/2-Using-EmrEtlRunner. Here’s the sample of EmrEtlRunner configuration file: https://github.com/snowplow/snowplow/blob/release/r115-morgantina/3-enrich/emr-etl-runner/config/config.yml.sample (note it might differ from version to version), and here’s the target configuration file sample: https://github.com/snowplow/snowplow/blob/release/r115-morgantina/4-storage/config/targets/redshift.json.

@ihor
Great! Thank you!

Hi, @ihor

I have installed the EmrEtlRunner on my server and need to do the configuration.

for the S3 part, I have some questions:

s3:
region: us-east-1
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: ADD HERE
encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-S3)
raw:
in: # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
- ADD HERE # e.g. s3://my-old-collector-bucket
- ADD HERE # e.g. s3://my-new-collector-bucket
processing: ADD HERE
archive: ADD HERE # e.g. s3://my-archive-bucket/raw
enriched:
good: ADD HERE # e.g. s3://my-out-bucket/enriched/good
bad: ADD HERE # e.g. s3://my-out-bucket/enriched/bad
errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below
archive: ADD HERE # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: ADD HERE # e.g. s3://my-out-bucket/shredded/good
bad: ADD HERE # e.g. s3://my-out-bucket/shredded/bad
errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below
archive: ADD HERE # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
consolidate_shredded_output: false # Whether to combine files when copying from hdfs to s3

Now I have 2 kinesis streams which are containing raw data from collector (good and bad), 3 streams are containing enriched data (good, bad, pii), 1 is containing s3 loader process fail. For S3 bucket, I have one as the destination of the S3 loader which is loading data from Stream-enriched-good to S3.

For the YAML configuration file of EMRETLRunner, we need raw:in(in, processing, archive), enriched(good, bad, errors, archive), shredded(good, bad, errors, archive), consolidate_shredded_output.
How do I give them value? Create a bucket for each type of these and sign them in?

Thank you! Much appreciated!

@AllenWeieiei, since you load the enriched stream data to S3 then I would suggest running EmrEtlRunner in Stream Enrich mode. That implies an extra bucket in your configuration file, namely enriched:stream (bucket you sink your enriched stream data) and you do not need raw:in bucket at all.

You can follow this guide explaining the configuration file in details: https://github.com/snowplow/snowplow/wiki/Common-configuration.

Hi @ihor.

How do I make it to Stream Enrich mode? Is there any entry in the configuration file control it?

Did you mean instead of using S3 bucket, using a stream for enriched?

Thanks!

@AllenWeieiei, mere presence of an additional enriched:stream bucket in configuration file sets EmrEtlRunner in Stream Enrich mode.

@ihor, ohhh, thanks!

enriched:stream is the bucket where I put my enriched event to through S3 loader from the stream, so what is enriched good and bad? I don’t have any bucket for them, those are just in a stream. Also, I did not set any process to shred data yet, do I need to do something about this before the EmrEtlRunner?

thank you again!

@AllenWeieiei, you need to create the other buckets. They are required during batch processing: https://github.com/snowplow/snowplow/wiki/Common-configuration#s3. RDB Shredder similarly to RDB Loader is specified in the configuration file and will be pulled down from the assets location, https://github.com/snowplow/snowplow/blob/release/r115-morgantina/3-enrich/emr-etl-runner/config/config.yml.sample#L72-L73.

@ihor Thanks,

I have another question about EC2 Key Pair. In the instruction on Github, there is a link https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-3x.html is talking about this. When I click that, it’s about cluster, applications and so on…about Amazon EMR.

Is there any specific instruction about how to set the key pair up?

Thank you!

@ihor especially the subnet id, for the key name when I create the key pair I will have it. For the subnet id, do I have to create a subnet for the VPC and use it’s ID?

@AllenWeieiei
You can find the instructions on how to create an EC2 Key Pair here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair
For creating and configuring your VPC subnet ID, you can find instructions here: https://docs.aws.amazon.com/vpc/latest/userguide/working-with-vpcs.html#AddaSubnet

2 Likes

Thank you so much!