We are currently setting up a Snowplow pipeline to ingest events from multiple sources, and we have different SQL enrichments to do based on the type (schema) of the event.
Is it advisable to to setup multiple pipelines instead for different types of events for scalability/efficiency of enrichment (since some enrichment is optional for some events)?
We are using the OSS version of Snowplow, and the Terraform template seems to suggest that we can only enable one SQL enrichment for one pipeline. Therefore, we are also exploring to use custom Javascript enrichment, to query different SQLs based on the schema of the events. Another alternative we see is to use dbt/manually creating views instead to model our data after loading the data (and would lose real-time enrichment and risk of having data in our operational database changed before enrichment is done). Which one would you suggest and are there any better alternatives?
And if using Javascript enrichment, how do we keep a connection pool so that we could reduce overhead of reopening connections?
This is feasible but that doesn’t sound like a viable solution as it would be more expensive to run and more work to operate. It’s better to think about ways to process all the different sources within a same pipeline.
That’s correct, it’s not possible to run several SQL enrichments yet.
Javascript enrichment is executed on the main compute thread pool, therefore it is not recommended to use it to execute blocking calls to external services. What you could do instead is to use the API enrichment and have your HTTP service perform the right SQL query depending on the event schema.
It depends if you want the value attached to the event to be the one that is in the database at the time the event is getting enriched or at the time the model is being run.
I’ll just add some context from working with other users before on similar stuff - I think it’s a good idea to evaluate whether you need this to happen in the pipeline itself.
In my experience, quite often the use case behind a user aiming to use the SQL enrichment is just as well served in data modeling after loading to the database.
If you DO have a requirement for this enrichment in real-time, and just one SQL enrichment won’t cut it, then it may be worth considering if hosting an API service would fit the bill instead.