Data modeling in real time

Hey Snowplow Community!

I’m working on a project where I need to ingest and model data in near-real-time, aiming for less than a minute from data collection to fully modeled data in my Snowflake warehouse.

I’m using dbt with the snowplow-web package and custom models that rely on the “snowplow_web_base_events_this_run” table.

The Challenge:
I’m looking for the best way to keep my “snowplow_web_base_events_this_run” table continuously updated with fresh data. I’m intrigued by Snowflake’s “dynamic table” feature, but I’m not sure how to integrate it with the snowplow-web dbt package.

Context:

  • Data Warehouse: Snowflake
  • For this discussion, let’s assume data collection is instantaneous and focus on the transformation aspect.
  • dbt Newbie Alert: I’m relatively new to dbt, so detailed advice would be greatly appreciated.

Looking forward to your insights and suggestions. Thanks!
Leonard

I’ll do a bit more research around dynamic tables next week, but my initial understanding of dynamic tables and that sort of near-real-time processing suggests that in general our packages are not well suited to this and to adapt them to be so would be very difficult.

The core reason for this is our sessionisation logic, while most individual derived tables use the standard incremental materialization, the reason we have snowplow_web_base_events_this_run is because this contains not just new events but also all events in the sessions that have new events (with some caveats). This core logic is why we strongly recommend people use our packages for derived processing at a session level. But at an event level, a basic incremental model would work fine. Because of this and the manifest table approach, I’m not convinced that it would easily work with dynamic tables (although I may be wrong after some more time spent understanding them), and also is probably still inefficient for a real-time usecase.

It depends somewhat on what your exact use-case is for needing near real-time data modelling, and I will just put some personal bias out here and say nearly every project I’ve seen where someone has wanted near real time modelled data they actually were better off using either the raw events stream or gained no benefit from the low latency, but if you need it you may be better off using a much simpler processing for a small volume of recent data, and keep the higher latency processing from the package for historic information.

As an example, say you do need some sessionisation, and you are trying to identify sessions of users with abandoned carts (a classic example, although a case where sub minute latency is not really needed), instead of trying to do lookups and manage manifests, you could just process all events in the last 2 hours, find sessions with a valid email and items in their cart, and no events for the last 5 minutes - you can then assume these users have abandoned their cart and trigger something like an email to them. It won’t be perfect but you’ll catch the vast majority of cases. The only additional thing you’d have to manage on top is to exclude sessions you’ve already contacted, but that would depending again on your exact use case and you’d only have to store those within that same 2 hour range (and your email contact tool may even have rules to avoid sending the same email within a certain time so you might not need to track it at all, in this specific example).

It’s probably a little more complex than you were hoping for, but I think while it may be possible we support dynamic tables in the future (as dbt snowflake is looking to add support for them in general), I don’t believe that the core logic of our packages is well suited for low-latency processing. Also, dbt in general is a batch tool, again you may be better looking at using the raw events if possible via something like Snowbridge and a spark-based transformation tool.

If you can share more details about your specific use case we might be able to offer some additional suggestions?

Thank you for your quick response!

I can start by giving you a bit of detail on my use case. I’m looking to collect web analytics and display them within my product. Specifically, I have content creators who will create web pages and publish them. Visitors will connect to these web pages, and we’ll collect the time they spend on each page and each section of the page (thanks to pings inspired by activity tracking, cf. this thread). Then, I want to provide content creators with graphs on the time spent by each visitor on each page/section.

Indeed, I think my use case doesn’t really require session data. I admit that I used Snowplow’s dbt package because I thought I understood that it was the recommended approach. I imagined, among other things, that the package allowed for managing edge cases such as event de-duplication.

So, would you recommend that I skip the dbt package and use the event table directly? If so, what are the edge cases handled by the package that I should be careful to manage manually?

This specific bit has the same issues as sessionization. You need to count all the page pings and aggregate that up to the pageview. It can be done in real time but it will be more complex and potentially expensive.

Do you really need this information in hard real-time? It gets significantly less complex at timescales of minutes instead of seconds.

Hi @Leonard_Henriquez

thanks for bringing up your case. I recently started researching solutions for similar use cases and came across materialize, tinybird.

I haven’t worked with one of it, both seams to be promising - especially tinybird. You can stream and load data into tinybird (a click house based SaaS) to develop real-time data products e.g. Vercel is using it to provide real time analytics to their customers. Details in this story Vercel Customer Story · Tinybird

Since you just need page pings and maybe page views the following architecture might be worth trying: Snowplow → Snowbrige (filtered to only e.g. page views/pings) forward as http-> tinybird → your app.

PS: You know your product best, but maybe think about if there is actually a need for real-time data in your use case. Maybe it would be enough to provide engagement data to your customers with a 15 minute delay?

Hope that helps.
David

Hi @davidher_mann !

Thanks for your advice! I’ve also come across materialize and tinybird in my benchmarking efforts. Both solutions look incredible. However, Snowflake has introduced features that also allow for similar types of materialization. On the other hand, having Snowplow’s data in my main data warehouse is quite practical, as I can easily join it with all my business data already available on Snowflake. I’d prefer not to introduce a new moving part unless absolutely necessary.

Regarding the delay, I don’t require actual real-time processing with sub-second latency, but I do need near-real-time capabilities (under a few minutes). This is important because we market to our users the benefit of being able to react almost instantly to new activity.

@Simon_Rumble , could you clarify how my requirements might be easier to meet at timescales of minutes rather than seconds?

Using an SQL database for let’s say 30 second resolution you’re going to be re-running your query at least every 30 seconds. How big a window is that going to re-read and re-count? What’s that costing? Running every 5 minutes is going to be quite a bit cheaper and is the data being about that stale really a problem? It’s a cost/benefit calculation you need to make.

There are alternative approaches of course, but then you’re introducing new complexity, as you said new moving parts.

So I’ve been reading more on Dynamic tables, and my understanding currently is they (in an incremental mode) as basically SQL on top of CDC from the source tables. No article, medium post, blogs, or docs I can find even have an example that aggregate data using a GROUP BY (though it is supported) in an incremental way - that’s a big worry for me, and aligns with what I thought which is that they aren’t really suitable for aggregation unless you know your next batch will contain all the relevant records for that group. You technically could do aggregation in an incremental way (even for division, but not distincts) but you run the risk that if any update does wrong you might not have a good fallback state (less likely when using dynamic tables outside of dbt, dbt has quite a poor mechanism for failures and rollbacks).

Assuming you just need page view/engagement information, and you want to streamline everything as much as possible, I think your best shot may be fine-tuning our package to just process what is needed for the derived page views table, then build a view (or materialized view probably to optimize the querying) that does a “pretty good” estimate of events from the last processed time (using our manifest table) and then unify the two. That way you get the benefit of the rigorousness of the snowplow dbt packages, but the low-latency that you require.

The other option would be to try and build what you need using dynamic tables, you’d be pretty close without needing to use any distincts but won’t be perfect, and may not be able to manage duplicates in such a certain way (but with a low enough ping granularity, it may not matter as much). I think dbt as a product isn’t very well positioned to offer low-latency data, and our packages unfortunately prioritise the correctness over speed in this case. The one thing you want to also be careful of is how this scales, which dbt would struggle with.

I think it’s going to be tricky whatever path you take, a bit like a certain uncertainty principle, the quicker your data is the less accurate your output might be, but I’m interested to see what you come up with!

3 Likes

If I were to take this kind of task on, I’d look at it this way:

As has already been flagged, I’d examine whether this needs to be done in real time rather than just at significantly low latency. The incremental model can run quite frequently so if a matter of minutes is acceptable I’d look at that as the solution.

If I did have to do it real-time, I would look at whether using the old drop-and-recompute model is better suited to be adapted to work with dynamic tables. It may not be suitable, but I think it has a better chance.

That model is long deprecated, however, so you’d be likely be investing yourself into technical debt headaches here, so I’d strongly consider whether I can compromise elsewhere instead. :slight_smile: