RDB Loader, Storage Loader, EmrEtlRunner

AllenWeieiei · October 17, 2019, 8:16pm

Hi Team,

After setting up the S3 loader to get data from enriched stream into S3, I try to set up the next step to get data into my final destination AWS Redshift.

When I reading articles, I am getting confused. For these 3 things, RDB Loader, Storage Loader, EmrEtlRunner, could you guys please help me to understand what are these? Which one is the right one for me? The github tutorial keep back and run in these 3 things and I was lost.

Thank you so much!

ihor · October 17, 2019, 9:52pm

@AllenWeieiei, RDB Loader is a new name for Storage Loader. You need to set up batch job orchestrated by EmrEtlRunner. The RDB Loader will be part of this data processing (specify it in EmrEtlRunner configuration file). To set up the job for this task you can start your journey from https://github.com/snowplow/snowplow/wiki/1-Installing-EmrEtlRunner. Do bear in mind 2 way of processing the data depending on if your load streamed data from Raw stream of from Enriched stream. The latter will enable Stream Enrich mode for EmrEtlRunner. The diagram depicting both modes is here.

AllenWeieiei · October 18, 2019, 3:29pm

Thanks @ihor.

So I need to install RDB Loader and EmrEtlRunner, then use EmrEtlRunner to orchestrate RDB Loader to get the whole process working?

ihor · October 18, 2019, 4:01pm

@AllenWeieiei, no, you do not need to install RDB Loader but you would need to provide the Redshift configuration. The RDB Loader app will be pulled into EMR cluster by EmrEtlRunner from our hosted assets location during pipeline job execution.

Check out EmrEtlRunner usage here: https://github.com/snowplow/snowplow/wiki/2-Using-EmrEtlRunner. Here’s the sample of EmrEtlRunner configuration file: https://github.com/snowplow/snowplow/blob/release/r115-morgantina/3-enrich/emr-etl-runner/config/config.yml.sample (note it might differ from version to version), and here’s the target configuration file sample: https://github.com/snowplow/snowplow/blob/release/r115-morgantina/4-storage/config/targets/redshift.json.

AllenWeieiei · October 18, 2019, 4:47pm

@ihor
Great! Thank you!

AllenWeieiei · October 18, 2019, 6:44pm

Hi, @ihor

I have installed the EmrEtlRunner on my server and need to do the configuration.

for the S3 part, I have some questions:

s3:
region: us-east-1
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: ADD HERE
encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-S3)
raw:
in: # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
- ADD HERE # e.g. s3://my-old-collector-bucket
- ADD HERE # e.g. s3://my-new-collector-bucket
processing: ADD HERE
archive: ADD HERE # e.g. s3://my-archive-bucket/raw
enriched:
good: ADD HERE # e.g. s3://my-out-bucket/enriched/good
bad: ADD HERE # e.g. s3://my-out-bucket/enriched/bad
errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below
archive: ADD HERE # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
shredded:
good: ADD HERE # e.g. s3://my-out-bucket/shredded/good
bad: ADD HERE # e.g. s3://my-out-bucket/shredded/bad
errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below
archive: ADD HERE # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
consolidate_shredded_output: false # Whether to combine files when copying from hdfs to s3

Now I have 2 kinesis streams which are containing raw data from collector (good and bad), 3 streams are containing enriched data (good, bad, pii), 1 is containing s3 loader process fail. For S3 bucket, I have one as the destination of the S3 loader which is loading data from Stream-enriched-good to S3.

For the YAML configuration file of EMRETLRunner, we need raw:in(in, processing, archive), enriched(good, bad, errors, archive), shredded(good, bad, errors, archive), consolidate_shredded_output.
How do I give them value? Create a bucket for each type of these and sign them in?

Thank you! Much appreciated!

ihor · October 18, 2019, 7:25pm

@AllenWeieiei, since you load the enriched stream data to S3 then I would suggest running EmrEtlRunner in Stream Enrich mode. That implies an extra bucket in your configuration file, namely enriched:stream (bucket you sink your enriched stream data) and you do not need raw:in bucket at all.

You can follow this guide explaining the configuration file in details: https://github.com/snowplow/snowplow/wiki/Common-configuration.

AllenWeieiei · October 18, 2019, 7:50pm

Hi @ihor.

How do I make it to Stream Enrich mode? Is there any entry in the configuration file control it?

Did you mean instead of using S3 bucket, using a stream for enriched?

Thanks!

ihor · October 18, 2019, 8:20pm

@AllenWeieiei, mere presence of an additional enriched:stream bucket in configuration file sets EmrEtlRunner in Stream Enrich mode.

AllenWeieiei · October 18, 2019, 8:35pm

@ihor, ohhh, thanks!

enriched:stream is the bucket where I put my enriched event to through S3 loader from the stream, so what is enriched good and bad? I don’t have any bucket for them, those are just in a stream. Also, I did not set any process to shred data yet, do I need to do something about this before the EmrEtlRunner?

thank you again!

ihor · October 18, 2019, 8:44pm

@AllenWeieiei, you need to create the other buckets. They are required during batch processing: https://github.com/snowplow/snowplow/wiki/Common-configuration#s3. RDB Shredder similarly to RDB Loader is specified in the configuration file and will be pulled down from the assets location, https://github.com/snowplow/snowplow/blob/release/r115-morgantina/3-enrich/emr-etl-runner/config/config.yml.sample#L72-L73.

AllenWeieiei · October 21, 2019, 5:52pm

@ihor Thanks,

I have another question about EC2 Key Pair. In the instruction on Github, there is a link https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-3x.html is talking about this. When I click that, it’s about cluster, applications and so on…about Amazon EMR.

Is there any specific instruction about how to set the key pair up?

Thank you!

AllenWeieiei · October 21, 2019, 6:38pm

@ihor especially the subnet id, for the key name when I create the key pair I will have it. For the subnet id, do I have to create a subnet for the VPC and use it’s ID?

Jenni · October 22, 2019, 12:50am

@AllenWeieiei
You can find the instructions on how to create an EC2 Key Pair here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair
For creating and configuring your VPC subnet ID, you can find instructions here: https://docs.aws.amazon.com/vpc/latest/userguide/working-with-vpcs.html#AddaSubnet

AllenWeieiei · October 22, 2019, 1:11pm

Thank you so much!

Topic		Replies	Views
Should I run rdb_load only? For engineers	7	1235	February 11, 2020
Loading data from s3 to Redshift after EmrEtlRunner Troubleshooting	7	3574	November 19, 2018
Most up-to-date approach to running RDBLoader Storage targets	2	1205	June 12, 2018
EmrEtlRunner not loading data into RedShift For engineers	22	2155	November 11, 2019
DEPRECATION NOTICE: EmrEtlRunner Announcements	2	950	October 8, 2021

RDB Loader, Storage Loader, EmrEtlRunner

Related topics