But it’s not very clear if I need to run both RDB Stream Shredder and RDB Loader (I guess yes) or just RDB Stream Shredder? And if RDB Stream Shredder and RDB Loader should share the same hocon (since it contains the config for shredding and loading) and be started with the same args? i.e.
Yes shredder and loader are 2 distinct apps and you need to run them both.
They use the same config file.
The loader needs very little resources and a t3a.micro is enough. For the shredder the size of the EMR cluster depends on the amount of data that you have. You can find more details about how to run it here.
For the shredder the size of the EMR cluster depends on the amount of data that you have. You can find more details about how to run it here .
Isn’t Stream Shredder supposed to not use EMR? From the announcement:
Unlike existing Spark Shredder, the Stream Shredder reads data directly from enriched Kinesis stream and does not use Spark (neither EMR) - it’s a plain JVM application, like Stream Enrich or S3 Loader.
Reading directly from Kinesis means that the Shredder can bypass long and error-prone S3DistCp staging/archiving steps. Another benefit is that it doesn’t work with bounded dataset anymore and can “emit” shredded folders based only on specified frequency.
Sorry I missed the fact that you were using the new streaming shredder. In this case indeed you don’t need EMR. But please bear in mind that this component is not production ready yet.
Just to add to @BenB 's comment about why it is not production ready:
It does not scale horizontally to have >1 streaming shredder running at the same time. The shredder sends a SQS message to tell the loader when to load, but this arrangement breaks if multiple shredders try to send the same SQS message.
It cannot do cross-batch deduplication of events
And we just have not battle-tested it in a high throughput pipeline yet.
The streaming shredder will certainly be a core part of Snowplow architecture in the future, just not yet.
@BenB@istreeter Small update just to share that I’ve been running the Stream Shredder in production for 1 month now and no issue so far. I’m running a single instance shredding 3 millions events/day in average (peak 6 millions).
Hi @guillaume - thanks for the update, that’s great to hear!
We have many ideas for where we can go with the streaming shredder in future. Such as enabling it to scale beyond just one instance, writing out to different file formats, and also porting it so it can run on different clouds, not just AWS. They’re all just ideas at the moment, so it’s good to hear you’re finding it valuable in its first incarnation.
hi @guillaume / @BenB / @istreeter - can anyone redirect me to correct stream-shredder documentation. I am new to snowplow and a bit lost in documentation.
Thanks
You can find the docs here. Note that the RDB shredder has been renamed to the RDB transformer. For the moment, you may still see references to “shredder” in some docs, but we’re working on updating this.
Hi @stanch - Thank you for quick reply.
I have few more doubts about setting up rdb_loader 5.7.1 (transformer + loader)
As I understand - the transformer takes input from kafka topic for enriched data and dumps it to s3 bucket; post that it will put a message in a kafka topic for loader. But I dont see any configuration in redshift loader config here to read message from a kafka topic.
Not at the moment — the Kafka Transformer asset only reads and writes to Kafka… For Redshift we use Kinesis on AWS, even though I understand that’s not what you want.
Hi @Dhruvi actually it is probably possible get the redshift loader consuming from a kafka topic. To be completely honest, the reason it’s not documented is because we have never tested that configuration. But you are welcome to give a try, there’s no reason it shouldn’t work.
In your config file for the redshift loader, try adding this block: