I’m exploring Snowplow and was wondering if it’s possible to deploy Snowplow without storage backends (e.g., Postgres, Redshift, etc.). My goal is to utilize Snowplow’s streaming functionality and load the events directly into Iceberg tables via the Lake Loader.
Is there a way to configure Snowplow to bypass the storage layer while maintaining its event processing pipeline for this use case? Any advice or best practices for this setup would be greatly appreciated!
Yes that’s a supported architecture, and actually a relatively normal/common one!
To explain a bit - what you’re referring to as ‘Snowplow’ here is what we generally call a ‘Snowplow pipeline’ - it’s actually a set of applications and cloud components which connect together (- branding doesn’t lend well to that concept so it can seem like ‘Snowplow’ is a single application. collector → stream → enrich → stream → loader are the main bits, but there’s lots more involved).
So what you’re after here fits into that typical architecture, and you’ll be looking to use lake loader.
There’s a lot of infrastructure involved though - the easiest way to get started and get your head around it all is to sign up for the community edition, and use that as your jumping off point. The quickstart guide docs will give you a sense of what’s involved in setting that up.
The distributions with Databricks destination use Lake Loader - but configured for a different Lake to what you’re after. So to set things up for Iceberg, you’d need to put a bit of effort into amending things to work with Iceberg (which is a supported format, we just don’t have a quickstart for it at present).
This will likely be far simpler than starting from scratch though! If I may suggest what I think is the probable path of least friction - I would:
Sign up for community edition and set up one of the standard pipelines, with a supported destination, to get started.
Set up some tracking, see things working end-to-end, and explore. At this point my aim would be to first make sure I see tracking work end to end and get familiar with things, and second to validate my tracking use case
Look into the quickstart code for how Databricks is set up, and work on setting up Iceberg format.
If you’ve already gotten familiar with things via steps 1 and 2, then you’ll be in a much better position to debug step 3 - and will know a lot more about what the possible explanations are for anything unexpected in the bit where you’re deviating from the quickstart.
Of course up to you how to approach things, but I hope this suggestion is helpful!