I am a little bit confused about where to use Kinesis streams and when to load data into S3 in our pipeline.
In the docs for the ScalaStreamCollector, it’s written that we have to provide two Kinesis streams; one for good events and one for events that are too big for a Kinesis stream or in the wrong format. But how would the latter be able to go through a Kinesis stream, if they are too big?
We want to persist our raw event data to S3 as a safety fallback.
2a) Where should we run the s3-loader? On the same instance as the collector? Or do we need a separate autoscaling group for this?
2b) Could we just directly dump the content of the raw event Kinesis stream into S3 via a Kinesis Firehose? Or would it be stored in the wrong format like that?
After StreamEnrich, we need to load the enriched data into S3 since that seems to be the input for the Snowflake transformer & loader. Do we need a separate instance here running the s3-loader?
Events that are too large are generally unrecoverable. There is a new (soon to be released) bad row format that attempts to better address this.
You could use Kinesis Analytics to pipe your raw stream to Kinesis Firehose that is configured to send data to s3 if you’re concerned that having only the enriched events in S3 but I think many might think thats overkill if you have proper testing of your event payloads against something like snowplow-mini.
You can run the s3-loader on the same ec2 instance that you run the collector on if you want. Some users run many of the components of the pipeline in containers with other things running on those hosts.