Stream vs Batch

mjensen · March 13, 2018, 3:22pm

Curious, how many are using Stream vs Batch or both?

We are using both.

christophe · March 13, 2018, 3:34pm

Too bad Discourse doesn’t support polls! We’re (obviously) using both at Snowplow.

gareth · March 14, 2018, 10:02am

We’re just using batch at Metail.

What is your use case for both?

jrpeck1989 · March 14, 2018, 10:04am

The common understanding I have is that you can use batch for the majority of analysis, and stream for real-time actioning or decisioning - to go into other applications (mobile notifications, triggered emails etc.)

Oh, and for real time dashboards as well.

mjensen · March 14, 2018, 11:46am

we use batch and run it hourly and trust this for our data marts downstream. we use stream to get real-time updates if we can’t wait for hourly batch jobs. we see small differences between batch and stream and trying to figure out why stream has less sometimes than batch.

gareth · March 14, 2018, 12:15pm

I think I’d like to know what a dual set up looks like, the closest I’ve gotten to the real-time pipeline is our test Snowplow Mini instance.

My naive imagination is something like running one collector which writes to kinesis, that writes the enriched and shredded data to S3 for batch analyses, perhaps loading the batches into a DB, certainly getting onto a dashboard. You also run some custom analysis on the real-time stream which goes into your dashboards but don’t persist this for very long. However now I’ve skimmed the stream enrich docs and it sounds like it only does the enrich step.

Where do you duplicate the events in the pipeline? Are you able to write the output of the stream enrich to S3 and start the batch from the shredding step? Got any pros and cons of your chosen solution?

Thanks

kazgurs1 · March 16, 2018, 8:20am

The common term for the infrastructure stack you are describing is the so called ‘Lambda architecture’ (not in any relation to AWS Lambda service), as described in this post:

https://discourse.snowplow.io/t/how-to-setup-a-lambda-architecture-for-snowplow/249

I think the most common use case would look like this (at least that’s what I set up recently):

an ASG with scala stream collectors behind a load balancer writing to a raw event kinesis stream
2 consumers for the raw kinesis stream:
- s3 loader app sinking raw lzo files to s3 bucket
- scala-stream-enrich app (also in an ASG), sinking enriched events to enriched_good and enriched_bad kinesis streams
elasticsearch-loader app consuming enriched_good kinesis stream and sinking events into your elasticsearch cluster
elasticsearch-loader app consuming enriched_bad kinesis stream and sinking events into your elasticsearch cluster
emr-etl-runner for the batch pipeline, reading lzo files from the s3 bucket

It is convenient to have all instances in ASGs, since it can be difficult to predict the performance of the RT pipeline and so you want to have monitoring in place, tune instance sizes/buffer settings and in the end, set up autoscaling for your components, including the kinesis shards. This is much easier accomplished compared to e.g. clojure collector running on ElasticBeanstalk - the collector logs are published every hour to s3, so downscaling requires extra work to prevent data loss.

anton · April 4, 2018, 4:44am

Just a small addition to this awesome thread. Yesterday we’ve released R102 Afontova Gora with new EmrEtlRunner’s Stream Enrich mode, which makes our Lambda architecture much simpler and more efficient. Now you don’t need to perform enrichment twice (in Spark Enrich and Stream Enrich) nor deal with complicated Dataflow Runner configurations.

Simon_Rumble · April 4, 2018, 6:02am

Hi Gareth. Here’s a diagram I use to illustrate the real-time pipeline.

Instead of using S3 buckets for inputs and outputs it has Kinesis in between. The enriched output goes to both Kinesis (and then Elasticsearch) as well as out to S3 where it’s loaded into a relational DB in batches.

This is a high-level diagram. The latest release notes have a good description of what’s happening under the hood which is a little different.

gareth · April 4, 2018, 12:46pm

Thanks @kazgurs1 I can see where the duplication comes into the pipeline. Nice explanation.

Also thanks @Simon_Rumble that’s a nice diagram. The latest release does appear to simplify things and reduce duplication of CPU cycles. That does make it more attractive.

We’ve never gone for a true lambda architecture, our batch pipeline is modelled on the batch part of a lambda architecture but we’ve never siphoned off a real-time stream. The latest release does look like it will make this setup much easiest to deploy and maintain.

Topic		Replies	Views
Why is Snowplow using Kinesis/Kafka for real-time pipeline? AWS real-time pipeline	4	6038	July 12, 2016
Is my version of snowplow lambda architecture correct For engineers	3	2213	May 17, 2018
Reading data from raw stream to both batch and real time For engineers	2	923	November 27, 2017
Configuring Batch + Real-time Pipelines in Parallel For engineers	6	2047	January 17, 2023
On-premise Snowplow Realtime Pipeline with Spark Streaming Enrich RFCs	1	4137	June 25, 2017

Stream vs Batch

Related topics