Does the Kinesis LZO S3 Sink support reading from an "enriched" stream?

cnamejj · July 14, 2016, 1:22am

The documentation implies that the tool to mirror the contents of a Kinesis stream to S3 files was developed to read raw events from a collector. The question I have is whether or not that tool can read enriched events from a Kinesis stream instead?

In other words, assuming my goal is only to load enriched events into Redshift can I use a pipeline like:

collector -> Kinesis -> enrich -> Kinesys -> LZO/sink -> S3 -> storage-loader -> Redshift

That avoids some S3 complexity at the cost of loosing the ability to push “raw” events into Redshift.

-jj

christoph-buente · July 14, 2016, 5:15am

Absolutely. We use the kinesis-s3 sink to store raw and enriched events. It’s actually a good idea, in case you need to re-enrich them later on. And even better: You can use gzip compression for the enriched events wich is easier to handle for spark clients for example.

alex · July 15, 2016, 6:56am

@christoph-buente is right - you can certainly use the Kinesis S3 sink to store enriched events to S3. But just a warning about:

collector -> Kinesis -> enrich -> Kinesys -> LZO/sink -> S3 -> storage-loader -> Redshift

This specific topology won’t work - because the StorageLoader doesn’t load enriched events directly into Redshift, but rather loads “shredded” enriched events into Redshift. Currently the only place this shredding logic lives is in our Hadoop Shred job, so you won’t be able to get away with not running EMR if you want to load Redshift.

This will change in the future but is still some way off. In the meantime, How to setup a Lambda architecture for Snowplow is the way to go.

cnamejj · July 15, 2016, 9:28pm

Ugh, that’s awful news.

cnamejj · July 15, 2016, 9:37pm

I don’t see any way to get data into Redshift (without processing the raw events with the older S3 based tools) in that linked lambda article. Am I missing something?

alex · July 15, 2016, 10:25pm

Correct, use the batch (Hadoop/EMR) pipeline to process the data ready for Redshift.

ihor · July 15, 2016, 10:32pm

Hi @cnamejj,

Indeed, you would need to run a batch pipeline. That is the idea of Lambda Architecture to be able to run Real-Time and Batch pipeline off the same source.

As @alex mentioned we will improve the architecture by allowing to sink to S3 off Kinesis Enrich stream in future and thus avoiding enrichment process run twice. As of now, however, the diagram depicts currently supported way to load streamed (and enriched) data into Redshift.

–Ihor

red_square · November 16, 2016, 3:02pm

Hi Alex- any update on this - are we talking years away?

alex · November 16, 2016, 11:25pm

Not years - we are starting on a major refactoring of our Redshift load code in a couple of months. But there’s a lot to refactor!

tjh34 · May 1, 2018, 5:56pm

Any update on this? I would like to be able to run stream enrich, store to S3, and copy to redshift on demand(Redshift Spectrum), as cnamejj mentioned this is much cheaper than maintaining all storage in Redshift. I already have data in S3 and everything is working the way I wanted to otherwise.

cnamejj · May 1, 2018, 10:10pm

FWIW, we ended up with a hybrid setup that runs Scala based collectors, which have been rock solid. Then we run jobs on instances to pull the data out of Kinesis streams into S3, where the older model EMR jobs process them and then load the data into Redshift. It’s working well but a pipeline that avoids S3 and EMR would be preferable, and I expect would get the data into Redshift much faster.

mike · May 1, 2018, 10:27pm

R102 introduced the stream enrich support for EMR ETL runner which avoids the “double enrich” problem that the Lambda architecture but this still requires spinning up an EMR cluster at the moment to load into Redshift.

One of the hurdles at the moment to querying data that is stored outside of Redshift (i.e., Redshift Spectrum) is that the data on S3 isn’t represented in a columnar fashion. Although it’s possible to query this data in the row-based format it involves a significant amount of scanning which translates to poorer performance and increased cost. Fortunately we can avoid this by converting to to a columnar format.

tjh34 · May 4, 2018, 2:11pm

Thank you cnamejj and Mike! This clarifies a lot and while it is different than I had envisioned it should still work well.

Topic		Replies	Views
Scala Kinesis Enrich AWS real-time pipeline	9	2884	April 9, 2018
Enriched event stream into Redshift using Kinesis Firehose AWS real-time pipeline	7	5766	May 31, 2016
Need help getting events from kinesis to s3 to redshift Storage targets	4	2518	April 6, 2016
Is it possible to load data to Redshift after StreamEnricher? Storage targets	10	2823	September 12, 2018
How to store kinesis enriched stream to redshift Storage targets	2	3316	June 1, 2016

Does the Kinesis LZO S3 Sink support reading from an "enriched" stream?

Related topics