Data modeling using Map Reduce

I was going through the book in which lambda architecture (Snowplow Lambda Architecture) was explained . In chapter 6 author was explaining how to compute batch views . He explained about importance of recomputation algorithms in computing batch views and as well how to compute batch views .

He was suggesting to run Map Reduce on Master data set to compute derived tables (Modeled tables) and load data to database .

In snowplow data-modeling is done by running queries on database join and compute new table and load it back in database .

How scalable is this solution and What are pro’s and con’s of this above approaches?

Hey @shashankn91,

Good question!

In snowplow data-modeling is done by running queries on database join and compute new table and load it back in database .

That’s indeed how most of our users do it. They load their Shredded Events into Redshift and run a SQL data modeling process within Redshift that outputs a set of derived tables.

The benefit of this approach is that it’s fairly easy to get a SQL data model up and running and iterate on it (the cluster is running 24/7 regardless and the model recomputes on the full dataset each time).

I usually recommend going with a SQL data modeling process if you don’t expect to quickly hit a billion total events. Once above that threshold, I usually see performance or Redshift costs become an issue. You can still use SQL if you expect to collect more data, but I’d instead use it to prototype the data model before porting it to something like Spark. An alternative option is to write an incremental model (tutorial). Another downside, however, is that SQL is not a great language for writing scalable and robust data modeling processes.

An alternative approach–and one we’re very bullish on–is to run the data modeling process on the Enriched Events, upstream of Redshift. We’re particularly excited about Spark. My colleague Anton wrote an excellent post on the topic: Replacing Amazon Redshift with Apache Spark for event data modeling [tutorial]


@christophe Thanks for detailed explanation . I read the blog post of @anton and it is very helpful .

I got to know about Factotum and how to use it with Spark . I will surely it give it a try .

This is kind of extension to my earlier question .

Currently in Lambda architecture(Snowplow uses Lambda Architecture) we are having separate pipeline for Speed Layer (real time) and Batch Layer . Jay Kreps wrote an article on questioning-the-lambda-architecture , what is a log ? and Merging batch streaming post lambda world . In this articles he explains new architecture called Kappa architecture which he developed for LinkedIn .

In Kappa architecture we could avoid 2 separate pipelines and have near realtime analytics . Now that @anton was suggesting of Spark for Data modeling I think Kappa architecture should be able to fit in easily with Snowplow. I think @anton thoughts are close to kappa architecture and as well every design has its own pro’s and Con’s.

I was just wondering if Snowplow have plans to provide platform to do similar thing in Snowplow ?

1 Like

It’s worth having a read of the recent RFC from @alex on porting the Snowplow pipeline to GCP as this makes more of a move away from Lambda in terms of moving stream processing into Beam/Dataflow rather than having to rely on something like EMR.

There are still some tricky issues around streaming (like deduplication and exactly once semantics) but it certainly looks like it’s an interesting way forward for analytics infrastructure.

1 Like

This might not be quite the right place for this question. In @anton’s blog post the processing starts from the enriched events. Why not start from the shredded and also use the shredded self describing events in the Spark analysis?

Hey @gareth - why not ask that question on the tutorial itself?

Because it’s already been asked? Sorry! I failed to scroll far enough before asking the question. Too long an afternoon clearly.