My goal is to set-up the batch + real-time pipelines in parallel so that we can access the real-time event data in Elasticsearch while also loading via batches to Redshift.
From my research (across Discourse and the legacy Google Group), it appears that this is possible, by doing the following -
- Set-up the Scala Stream collector, write to a Kinesis stream
- [batch] Set-up the Kinesis LZO S3 sink to write the raw events to S3
- [batch] Configure the EMR ETL Runner / StorageLoader to enrich + load thrift records from S3
- [real-time] Set-up a parallel stream enrichment process
- [real-time] Set-up the Kinesis Elasticsearch Sink
Is this correct?
Exactly right you are!
Bear in mind that this scenario implies essentially two enrichment processes (one for each of the pipeline). We are working on “unifying” this step to be common for both of the pipelines running in parallel.
Thanks @ihor, I’m glad to hear it.
Is there an example of the configuration for this process? Essentially, I’m trying to understand how to feed each enrichment processes (batch + real-time) from the same collector output.
Hopefully, the below “workflow diagram” showing one of the typical scenarios will clarify it.
|-> Kinesis Raw Events Stream
|-> [BATCH]: Kinesis S3 -> Amazon S3 (raw) -> EMR -> Amazon S3 (shredded) -> Redshift
| \ Orchestrated by EmrEtlRunner & StorageLoder /
|-> [REAL-TIME]: Kinesis Enrich
|-> Bad Events Stream -> Kinesis Elasticsearch Sink -> Elasticsearch
|-> Enriched Events Stream -> Kinesis Elasticsearch Sink -> Elasticsearch
|-> (optional) Custom stream job (ex. AWS Lambda)
Raw events stream is common for both pipeline
Batch: standard batch pipline with
raw:in bucket (in
config.yml) pointing to
Amazon S3 (raw) in the diagram
Real-Time: You would need at least 2 streams (for enriched and bad events)
Got it – we can run both the batch and real-time processes on all of our data from a single Kinesis raw events stream.
Thanks for the clarification!
Hi all this is an old thread but we are looking to do a similar thing as in have parallel pipelines where we are batching into redshift. I read elsewhere that snowplow has moved away from this but for us this is the right architecture as it is vastly cheaper to batch into redshift.
Is this still possible? We implemented the bootstrap to try out snowplow and connected kinesis to redshift however behind the scenes it is constantly hitting redshift which is driving our costs up unnecessarily. we would be find with batching a few times a day / daily.
@yangabanga spark transformer is actively maintained, but we do not provide a terraform stack to go with it.
You could also consider increasing window size here in steam transformer to 60min or longer to get fewer batches. It is likely to be a lot cheaper than running EMR, even at minimum configuration.