Can you run multiple instances of the S3 Loader(Kinesis)

Michael_Smith · November 10, 2020, 7:06pm

I am pretty new to Snowplow and was working on migrating our S3 Loader to Fargate. I understand the loader uses DynamoDB to keep track of it’s position(for writing to S3) but was wondering if it’s even possible to run multiple instances of the S3 Loader. I tried spinning up several instances of the loader but I think each instance was using the same DynamoDB table which was causing issues. Is it possible to run multiple S3 Loaders and if so are there any suggestions on how to do it correctly?

jeroen.v · November 10, 2020, 7:37pm

Hi Michael, did you use a different kinesis.appName in the config for each loader?

Michael_Smith · November 10, 2020, 7:59pm

I hadn’t yet but a fellow developer had mentioned that. That sounds like my next step. Quick dumb question. If each S3 Loader is writing to the same S3 bucket and the strategy is TRIM_HORIZON and they are using different DDB tables, how does each instance of the S3 Loader avoid writing out the same records? Thanks again for the response, Jeroen.

jeroen.v · November 10, 2020, 8:41pm

Aha, multiple loaders to the same target. In that case you indeed need to use the same application name. I’m using multiple enrichers for example and they share the same appName. They also subscribe to Kinesis so the process is probably the same.

I haven’t tried with multiple S3 loaders but I guess it should also work, same as for the enrichers.
Another thing that might block you from doing this is when you do not have enough shards. Each loader will claim a shard.

Hope this helps.

Michael_Smith · November 10, 2020, 9:17pm

I just found out another dev was using the same dynamodb table with another S3 loader, so that was my issue.

josh · November 12, 2020, 10:34am

Hi all!

Just to add some clarity into how the Kinesis applications work on the Snowplow pipeline here. Each app (S3 Loader, ES Loaders, Stream Enrich) leverages the KCL (Kinesis Client Library) under the hood.

Each app is capable of scaling horizontally as long as they are pointing to the same Kinesis Stream and use the same DynamoDB Lease table - this lease table contains all the shards in the stream that the application is consuming and ensures that work is spread evenly amongst the available workers. This table name in turn is controlled by the “app name” you define in the configuration files.

To sum up:

If you want to have more compute available to process 1 Kinesis Stream into 1 Destination - use the same app name and it will scale out horizontally.
If you instead want to have that same stream loaded to different destinations you need to use a different app name so that it has a different lease table and different checkpointing on the stream.

Hope this helps!

Michael_Smith · November 12, 2020, 2:48pm

That’s very good information. Thanks, josh.

Are there any documents on scaling this stack out(using kinesis)? We are currently working on several strategies around CPU utilization and traffic but it’s still a work in progress.

josh · November 12, 2020, 4:27pm

No worries!

There is scattered documentation but nothing very concrete. Internally we scale the consumers based on CPU usage and have found this to be the most reliable metric to scale off of. Adding step-scaling rules in place to handle spikes effectively can help also.

For Kinesis you have really just two options:

Develop an auto-scaling solution for it - there is nothing out of the box here so does require a bit of investment OR;
Monitor and set CloudWatch Alarms on Byte GET / PUT rates to the stream and scale up / down as needed by hand

Depending on your traffic variance you might get away with just a static shard allocation for your stream without needing to do anything else.

One thing to keep in mind is that multiple servers cannot process the same shard so your ability to process data is dependant on having enough shards in Kinesis. Essentially having more servers than shards will not yield better performance as you cannot split the work.

Topic		Replies	Views
Be aware that S3 Loader requires access to DynamoDB Storage targets	0	1222	August 8, 2018
S3 Loader Not Loading Data from Stream Storage targets	2	2038	October 15, 2018
S3 Loader does not connect to dynamodb localstack Storage targets	13	2501	June 2, 2020
Snowplow S3 Loader 0.6.0 released New releases	8	1689	June 29, 2020
Kinesis Streams vs S3 Buckets AWS real-time pipeline	4	5461	February 17, 2020

Can you run multiple instances of the S3 Loader(Kinesis)

Related topics