Stream vs Batch

Curious, how many are using Stream vs Batch or both?

We are using both.

Too bad Discourse doesn’t support polls! We’re (obviously) using both at Snowplow.

1 Like

We’re just using batch at Metail.

What is your use case for both?

The common understanding I have is that you can use batch for the majority of analysis, and stream for real-time actioning or decisioning - to go into other applications (mobile notifications, triggered emails etc.)

Oh, and for real time dashboards as well.

we use batch and run it hourly and trust this for our data marts downstream. we use stream to get real-time updates if we can’t wait for hourly batch jobs. we see small differences between batch and stream and trying to figure out why stream has less sometimes than batch.

I think I’d like to know what a dual set up looks like, the closest I’ve gotten to the real-time pipeline is our test Snowplow Mini instance.

My naive imagination is something like running one collector which writes to kinesis, that writes the enriched and shredded data to S3 for batch analyses, perhaps loading the batches into a DB, certainly getting onto a dashboard. You also run some custom analysis on the real-time stream which goes into your dashboards but don’t persist this for very long. However now I’ve skimmed the stream enrich docs and it sounds like it only does the enrich step.

Where do you duplicate the events in the pipeline? Are you able to write the output of the stream enrich to S3 and start the batch from the shredding step? Got any pros and cons of your chosen solution?

Thanks

The common term for the infrastructure stack you are describing is the so called ‘Lambda architecture’ (not in any relation to AWS Lambda service), as described in this post:

https://discourse.snowplow.io/t/how-to-setup-a-lambda-architecture-for-snowplow/249

I think the most common use case would look like this (at least that’s what I set up recently):

  • an ASG with scala stream collectors behind a load balancer writing to a raw event kinesis stream
  • 2 consumers for the raw kinesis stream:
    - s3 loader app sinking raw lzo files to s3 bucket
    - scala-stream-enrich app (also in an ASG), sinking enriched events to enriched_good and enriched_bad kinesis streams
  • elasticsearch-loader app consuming enriched_good kinesis stream and sinking events into your elasticsearch cluster
  • elasticsearch-loader app consuming enriched_bad kinesis stream and sinking events into your elasticsearch cluster
  • emr-etl-runner for the batch pipeline, reading lzo files from the s3 bucket

It is convenient to have all instances in ASGs, since it can be difficult to predict the performance of the RT pipeline and so you want to have monitoring in place, tune instance sizes/buffer settings and in the end, set up autoscaling for your components, including the kinesis shards. This is much easier accomplished compared to e.g. clojure collector running on ElasticBeanstalk - the collector logs are published every hour to s3, so downscaling requires extra work to prevent data loss.

2 Likes

Just a small addition to this awesome thread. Yesterday we’ve released R102 Afontova Gora with new EmrEtlRunner’s Stream Enrich mode, which makes our Lambda architecture much simpler and more efficient. Now you don’t need to perform enrichment twice (in Spark Enrich and Stream Enrich) nor deal with complicated Dataflow Runner configurations.

4 Likes

Hi Gareth. Here’s a diagram I use to illustrate the real-time pipeline.

Instead of using S3 buckets for inputs and outputs it has Kinesis in between. The enriched output goes to both Kinesis (and then Elasticsearch) as well as out to S3 where it’s loaded into a relational DB in batches.

This is a high-level diagram. The latest release notes have a good description of what’s happening under the hood which is a little different.

3 Likes

Thanks @kazgurs1 I can see where the duplication comes into the pipeline. Nice explanation.

Also thanks @Simon_Rumble that’s a nice diagram. The latest release does appear to simplify things and reduce duplication of CPU cycles. That does make it more attractive.

We’ve never gone for a true lambda architecture, our batch pipeline is modelled on the batch part of a lambda architecture but we’ve never siphoned off a real-time stream. The latest release does look like it will make this setup much easiest to deploy and maintain.