The documentation implies that the tool to mirror the contents of a Kinesis stream to S3 files was developed to read raw events from a collector. The question I have is whether or not that tool can read enriched events from a Kinesis stream instead?
In other words, assuming my goal is only to load enriched events into Redshift can I use a pipeline like:
Absolutely. We use the kinesis-s3 sink to store raw and enriched events. It’s actually a good idea, in case you need to re-enrich them later on. And even better: You can use gzip compression for the enriched events wich is easier to handle for spark clients for example.
This specific topology won’t work - because the StorageLoader doesn’t load enriched events directly into Redshift, but rather loads “shredded” enriched events into Redshift. Currently the only place this shredding logic lives is in our Hadoop Shred job, so you won’t be able to get away with not running EMR if you want to load Redshift.
I don’t see any way to get data into Redshift (without processing the raw events with the older S3 based tools) in that linked lambda article. Am I missing something?
Indeed, you would need to run a batch pipeline. That is the idea of Lambda Architecture to be able to run Real-Time and Batch pipeline off the same source.
As @alex mentioned we will improve the architecture by allowing to sink to S3 off Kinesis Enrich stream in future and thus avoiding enrichment process run twice. As of now, however, the diagram depicts currently supported way to load streamed (and enriched) data into Redshift.
Any update on this? I would like to be able to run stream enrich, store to S3, and copy to redshift on demand(Redshift Spectrum), as cnamejj mentioned this is much cheaper than maintaining all storage in Redshift. I already have data in S3 and everything is working the way I wanted to otherwise.
FWIW, we ended up with a hybrid setup that runs Scala based collectors, which have been rock solid. Then we run jobs on instances to pull the data out of Kinesis streams into S3, where the older model EMR jobs process them and then load the data into Redshift. It’s working well but a pipeline that avoids S3 and EMR would be preferable, and I expect would get the data into Redshift much faster.
R102 introduced the stream enrich support for EMR ETL runner which avoids the “double enrich” problem that the Lambda architecture but this still requires spinning up an EMR cluster at the moment to load into Redshift.
One of the hurdles at the moment to querying data that is stored outside of Redshift (i.e., Redshift Spectrum) is that the data on S3 isn’t represented in a columnar fashion. Although it’s possible to query this data in the row-based format it involves a significant amount of scanning which translates to poorer performance and increased cost. Fortunately we can avoid this by converting to to a columnar format.