Snowplow and the Apache Iceberg Ecosystem

Hello Snowplow,

We are currently using Snowplow with Snowflake as the CDW of choice and are quite happy with the results.

But we are more and more thinking about building a central data lake asset where we would centralize all our data sources and main transformation layer. This is made possible by some new features that the Apache Iceberg table format brings to the lake (like schema evolution,full DML support and so on).

Currently, with RDB loader, the consumption of Snowplow data is mainly focused on the DWH (whether it is BigQuery, Snowflake or Redshift) but I wanted to ask you what kind of approaches we would have to take if we wanted to build a full Snowplow data modelling stack in S3/Iceberg? So basically having the Atomic Event Table in S3 with some engine behind (Presto? Spark? Something else?) that would process this data in incremental way to build sessions, users, pageview tables always in S3.

I would love to hear any experience of users who tried that or even Snowplow future directions on this aspect.

Thanks!
Anthony

1 Like

This is definitely something we’re looking into - some customers are doing this already (data warehouse on Parquet using Athena / Presto) but I haven’t seen anyone using Iceberg yet (but it’s not a stretch to get from Parquet => Iceberg).

@stanch probably has some thoughts on this and can share if it’s on the current roadmap or not.

Hello Mike,
Thanks for your first feedback! Definitely interested to go further in the discussion :slight_smile:

Hi @Anthony_Gianastasio

I have a lot of interest in this topic, and I’ve been playing around with some initial ideas of how we could support this.

Most data warehouses now have some level of support for Iceberg or Delta as external tables. My hope is we will build a Snowplow pipeline that writes events directly into Iceberg/Delta tables, using parquet as the underlying file type. This could work out much cheaper for Snowplow users, because it wouldn’t require any warehouse compute to load the events.

Once the events are written to Iceberg/Delta, users could configure their Data Warehouse to query the events directly from the parquet files via an external table. My intial tests show that this setup performs very well for typical queries of Snowplow data.

You mentioned that you currently use Snowflake for Snowplow events, but then you only mentioned Presto and Spark as possible engines for querying Iceberg. Would you also consider using Snowflake as the query engine? Or do you think you would move away from Snowflake, if we enabled loading to Iceberg? Snowflake’s support for Iceberg is still in private preview, but I think it’s likely to be GA soon.

Also, is there a reason you asked about Iceberg, rather than Delta? I think either format could work well for Snowplow events; the main difference is in the level of support from different warehouses.

I should say… even though I am very excited about this idea, we haven’t oficially started working on it yet. But I think there is a very high chance that Snowplow moves towards loading Iceberg and/or Delta as a primary destination in the next phase of our roadmap.

3 Likes

Hello Ian,
Thank you so much for your feedback!

In our initial ideas/thoughts, we were thinking to store everything as Iceberg tables in S3 and then choose the engine most suited depending on the task: could be Snowflake for batch BI use cases reading directly modelled Iceberg tables, could be Dremio for large scale self service analytics, could be some transformations ran as Spark jobs. I think the interoperability of Iceberg would allow this extreme flexibility and having the right tool for the right task could result in much less data copies/movements for at the end in major cost saving!

I’ll be more than happy to keep moving forward with this discussion as you are making progress in the analysis/integration of this feature on the Snowplow roadmap!

Have a great day,
Anthony

1 Like