Query on Snowplow Enrichment Configuration

Hi Team,
I am in need of configuring the default enrichment, in order to split the events based on app_id and push them into the respective streams!

Current Approach:
Now, we have the default snowplow collector and enricher, and both the configurations are based on the documentation provided. In enricher, we are pushing the events into two streams.

After Enrichment:
Good Events → Good Events Stream
Bad Events → Bad Events Stream

The problem which we are facing is, we have the pipeline for two applications and the splitting of the events is done in the splitter lambda and it pushes the events into respective application streams, we are thinking of an alternative approach to split the events in the enrichment phase of snowplow enricher so that we’ll be sending the events into respective streams without the use of lambda and since its auto-scalable, the number of incoming events won’t be an issue.

What we are planning:

Good Events App_1 events–> Good Events App_1 Stream
Good Events App_2 events–> Good Events App_2 Stream
Good Events App_n events–> Good Events App_n Stream
Bad Events App_1 events–> Bad Events App_1 Stream
Bad Events App_2 events–> Bad Events App_2 Stream
Bad Events App_n events–> Bad Events App_n Stream

Will this be feasible or how can we proceed with this idea, It would be really helpful to know more about it!

Thanks !

1 Like

Would love to know if you made any progress on this!

Or if others have thoughts on it.

Hi @Prasanth_Rosario,

Apologies you didn’t get an answer sooner - I’m not sure how this one slipped through the cracks.

It’s not possible for enrichment to split events like this - the design of the pipeline is to collect events from many different sources into one unified events stream. You would typically identify events from different apps via app_id.

Typically use cases we’ve come across before which require a separation of concerns like this have also come with requirements to completely separate concerns - and so have required distinct pipelines (eg. for purposes of geographical compliance).

So, as I see it, the best available option to split events out is to do it post-enrichment - for example using an AWS Lambda to consume the enriched stream.

It’s also worth considering if it is actually a hard requirement that the data from different apps lands in different streams. In a typical real-time use case, what I have normally seen for this type of requirement is one single app consumes the enriched stream and determines what to do with the data based on a value like app_id. In a more typical analytics use case, it’s normally relatively straightforward to carry this kind of thing out in SQL at data modeling stage.

I hope that’s helpful, and apologies again for not having gotten you an answer sooner.

Best,

2 Likes

As Colm has mentioned this is kind of difficult to achieve (without reading off Kinesis and using something like Lambda).

Technically you could do a bit of a fan out during the enrichment phase (using the API / Javascript enrichment) but that doesn’t solve your bad events issue - so I agree with Colm here in either doing this post-enrichment or performing this earlier rather than later.

From an earlier perspective you can do this with a single load balancer and map paths to different collectors which can make things a little bit more simple.

Good Events App_1 events–> Good Events App_1 Stream
Good Events App_2 events–> Good Events App_2 Stream
Good Events App_n events–> Good Events App_n Stream
Bad Events App_1 events–> Bad Events App_1 Stream
Bad Events App_2 events–> Bad Events App_2 Stream
Bad Events App_n events–> Bad Events App_n Stream

This is doable (and easier in GCP where you can create subscriptions to specific app ids, rather than processing every message) but is trickier to achieve in AWS.

In part this is because bad events can happen at multiple parts of the pipeline - including the collector where there isn’t yet a clear representation of app_id until enrich happens.

From a scaling point of view this can also be a little tricky in a ‘one stream’ per app id model. Each stream needs to be scaled independently, and the enricher (or whatever is splitting the events) needs to be able to gracefully retry on each Kinesis stream which can potentially increase latency in the enrichment process. It is manageable but with more streams comes more complexity and edge handling (How are app ids pattern matched? What if a stream no longer exists? What happens for events with app_ids that don’t match a stream name? Can surge protection (SQS) be easily reconciled with this model?)

2 Likes