Filtered topic streams

Hello there

We have begun implementing Snowplow as our main consumer events pipeline at work.
Really happy with how far you can go with out-of-the-box configurations and guides. Kudos to the team and everyone involved.

One of the requirements we are fielding from the business right now is this - individual teams want a filtered view of events that they care about - in realtime, of course.

The context is that we are using a ‘fat’ pipeline to process all the consumer events. All of it gets sinked in destinations like S3 and Google Bigquery - and hence filtering out the batched data is a solved issue.
We would love to have a similar solution for a “filtered realtime view” of events flowing through.

I’d love to get your thoughts/ experiences around this topic.

If you only need a subset of events (and you don’t want to read off the main stream) one option is to use Kinesis Tee or a similar Lambda/processor to split the fat stream out into individual topics that have been filtered according to a condition.

If you’re using Lambda one thing you’ll have to be wary of is scaling these (it’s easier to just set a conservatively high number of shards) based on the read/write requirements for that stream.

1 Like

Thanks @mike for sharing your thoughts.
We landed on similar ideas and conclusions as you mentioned above.

I’ll keep you posted on how we progress.

One of the ideas we are bouncing around in the team is to consume from the fat stream into multiple Kinesis Firehoses and each of it filter using the ‘processor’ lambda that decides which records to ‘keep’ and deliver them into S3 in micro-batches. Individual teams could listen to S3 notifications and consume the data as they see fit.

Does anyone see any immediate/obvious pitfalls with this approach?

The reason we are thinking about Firehose is to ensure there’s no back-pressure on the rest of the pipeline.

Hey @mike i was told Kinesis Tee was deprecated: https://github.com/snowplow/kinesis-tee/pull/32#issuecomment-503890710

What are the alternatives? Kinesis analytics? A variation of the s3/es kinesis sinks to write to a new kinesis stream?

Any suggestions?

I didn’t realise Kinesis Tee has been deprecated but it looks like the underlying library that it used for deployment of Lambda functions via Cloudformation (Gordon) also hasn’t seen any updates for a few years.

Depending on what you’re trying to achieve either Kinesis Analytics or Lambda functions are both sounds options. I don’t think I’d bother too much with the existing S3 / ES sinks unless there’s logic there that you’d like to reuse.

For Lambdas - the advantage is that you can write this in whatever language you’d like and chances are you can incorporate your filtering (or transforms) within the Lambda function itself. Depending on your volumes or latency requirements this may not necessarily be cost effective or desirable however.

For Kinesis Analytics this is probably the option I would go down today. Kinesis Analytics SQL is still a bit of a pain so I’d avoid this - it wants your data to desperately be homogeneous in terms of TSV or JSON so the unusual mixture of the enriched event of being a hybrid means that it can be quite difficult to work with. It’s significantly easier to work with if you’re converting it to JSON upstream (say using the analytics SDK in a Lambda function) but that’s extra work.

The introduction of Flink capability in Kinesis Analytics makes this a bit easier - if you’re comfortable with Java / Scala you can simply consume the enriched Kinesis Stream, transform the enriched event into JSON using the analytics SDK (if required) and then use a filter / transform function to get your desired output. Once you’ve got that output you should be able to write that out depending on what your desired target is.

If you want to make things a little trickier you could also write the job in Apache Beam rather than Flink but then you’d have to run this through a persistent EMR cluster which isn’t ideal.