My goal is to set-up the batch + real-time pipelines in parallel so that we can access the real-time event data in Elasticsearch while also loading via batches to Redshift.
From my research (across Discourse and the legacy Google Group), it appears that this is possible, by doing the following -
Set-up the Scala Stream collector, write to a Kinesis stream
[batch] Set-up the Kinesis LZO S3 sink to write the raw events to S3
[batch] Configure the EMR ETL Runner / StorageLoader to enrich + load thrift records from S3
[real-time] Set-up a parallel stream enrichment process
Bear in mind that this scenario implies essentially two enrichment processes (one for each of the pipeline). We are working on “unifying” this step to be common for both of the pipelines running in parallel.
Is there an example of the configuration for this process? Essentially, I’m trying to understand how to feed each enrichment processes (batch + real-time) from the same collector output.
Hi all this is an old thread but we are looking to do a similar thing as in have parallel pipelines where we are batching into redshift. I read elsewhere that snowplow has moved away from this but for us this is the right architecture as it is vastly cheaper to batch into redshift.
Is this still possible? We implemented the bootstrap to try out snowplow and connected kinesis to redshift however behind the scenes it is constantly hitting redshift which is driving our costs up unnecessarily. we would be find with batching a few times a day / daily.
@yangabanga spark transformer is actively maintained, but we do not provide a terraform stack to go with it.
You could also consider increasing window size here in steam transformer to 60min or longer to get fewer batches. It is likely to be a lot cheaper than running EMR, even at minimum configuration.