Hi Snowplowers,
I presume when we set up batch pipeline successfully we move into RT I came across this great post from @ihor. And further came across other posts where it was mentioned that that Drip feed into Redshift wasn’t yet possible.
Am I correct to say that if I have the BATCH pipeline set up, then I can set up the RT stream by setting up
KinesisEnrich --> KinesisGood/Bad Stream --> Kinesis ES Sink --> Kibana. The only difference is that is two parallel stream and in future we might not require Batch at all once the RS drip feed is ready ?
If you setup the real time pipeline you’ll get both the Elasticsearch sink (realtime) and the Redshift sink (batch) without duplicating architecture.
I’m not sure about the feasibility of drip-feeding Redshift, from the implementations I’ve seen it never quite works that well. Redshift is an excellent data warehouse but a poor real time analytics database.
If you already have the batch pipeline setup, there is no equivalent lambda architecture you can introduce to then bring in the real-time load into Elasticsearch. This is because the Clojure Collector cannot feed our real-time pipeline - it’s a fundamentally batch-oriented collector (it rotates to S3 hourly).
Your options are:
Setup a complete end-to-end real-time pipeline in parallel (i.e. starting from Scala Stream Collector onwards)
Rebuild your setup to use the standard Snowplow lambda architecture that you reference