Deploying Snowplow Without Storage Backend

Yiannis_Gkoufas · December 13, 2024, 2:40pm

Hi everyone,

I’m exploring Snowplow and was wondering if it’s possible to deploy Snowplow without storage backends (e.g., Postgres, Redshift, etc.). My goal is to utilize Snowplow’s streaming functionality and load the events directly into Iceberg tables via the Lake Loader.

Is there a way to configure Snowplow to bypass the storage layer while maintaining its event processing pipeline for this use case? Any advice or best practices for this setup would be greatly appreciated!

Colm · December 13, 2024, 4:16pm

Hi @Yiannis_Gkoufas!

Yes that’s a supported architecture, and actually a relatively normal/common one!

To explain a bit - what you’re referring to as ‘Snowplow’ here is what we generally call a ‘Snowplow pipeline’ - it’s actually a set of applications and cloud components which connect together (- branding doesn’t lend well to that concept so it can seem like ‘Snowplow’ is a single application. collector → stream → enrich → stream → loader are the main bits, but there’s lots more involved).

So what you’re after here fits into that typical architecture, and you’ll be looking to use lake loader.

There’s a lot of infrastructure involved though - the easiest way to get started and get your head around it all is to sign up for the community edition, and use that as your jumping off point. The quickstart guide docs will give you a sense of what’s involved in setting that up.

The distributions with Databricks destination use Lake Loader - but configured for a different Lake to what you’re after. So to set things up for Iceberg, you’d need to put a bit of effort into amending things to work with Iceberg (which is a supported format, we just don’t have a quickstart for it at present).

This will likely be far simpler than starting from scratch though! If I may suggest what I think is the probable path of least friction - I would:

1. Sign up for community edition and set up one of the standard pipelines, with a supported destination, to get started.
1. Set up some tracking, see things working end-to-end, and explore. At this point my aim would be to first make sure I see tracking work end to end and get familiar with things, and second to validate my tracking use case
1. Look into the quickstart code for how Databricks is set up, and work on setting up Iceberg format.

If you’ve already gotten familiar with things via steps 1 and 2, then you’ll be in a much better position to debug step 3 - and will know a lot more about what the possible explanations are for anything unexpected in the bit where you’re deviating from the quickstart.

Of course up to you how to approach things, but I hope this suggestion is helpful!

Yiannis_Gkoufas · December 13, 2024, 4:31pm

Awesome! Thanks so much!
Signed up for the community version and will try to get it to a point where I can work on step 3 and follow up here.

Topic		Replies	Views
Snowplow and the Apache Iceberg Ecosystem Storage targets	4	1217	April 24, 2023
Can we Setup Snowplow without using AWS? Duplicate	2	1676	February 27, 2018
How to run StorageLoader in standalone system without connecting to external server eg: Amazon AWS Storage targets	1	2042	July 19, 2017
Aws quickstart optimized snowplow infra For engineers	3	736	January 30, 2023
Using Snowplow with Postgresql	1	1357	December 10, 2019

Deploying Snowplow Without Storage Backend

Related topics