Migration from batch processing to (near) real-time


We are currently working with emr etl runner (ver. 104) and use cloudfront as our collector.
We would like to achieve near real time events processing, step-by-step, first replacing the cloudfront collector with scala stream collector.

I’ve setup the collector, which works with kinesis stream.
The data is being consumed by kinesis firehose which saves the data into s3.

So far, everything is working.

Then I found out the the record format is different for the collectors and that I need to use Kinesis LZO S3 Sink to consume the data from kinesis firehose and save it to s3 in the right format so that the emr etl runner would be able to process it.
I looked into the documentation of it but it seems that the repository no longer exists.

So, anyone knows if there is a new project for that or any other solution?

  • we are trying to set things up in docker containers so a solution that was built for that would be highly appreciated :slight_smile:


Sorry, I found the repository.

Hi @moshesh - could you let us know where in the documentation is referencing that old path so we can update it?

For those who run into a similar issue the project has been moved here:

The docker image is available from here:

Hi @josh,

Sorry for the delayed reply - the link of s3 loader under the “thrift” section:

And if you are working on documentation fixes (:wink:), the configuration example in elasticsearch loader does not work with the latest version, I had to download the source code of the latest version and work with the one there.


Hai Moshesh, When you migrating to the real time. Do you compare your Batch data and Real time data. If you do compare, how you handle data is not similar with your Batch Data and Near-Real-Time Data.