Replay collector data from s3 firehose files to enrich

Colm · June 2, 2021, 7:42pm

So this might be a tricky one, depending on what firehose is doing with the data.

I know others have run into issues with enriched data using Firehose, where the issue was essentially down to Firehose not adding a newline between events (I think this was one such relevant thread).

It does seem weird that firehose wouldn’t have internal consistency with the aws-kinesis-agent you linked, but it’s an area that I’m not terribly familiar with, so I’m not certain.

Assuming it is some similar issue, I reckon one way of going about it is to download one of the files, and try to identify a single collector payload, see if you can send just one alone into the stream successfully, and identify from there how to write a script to do that for the rest of the data. Collector payloads are thrift-encoded according to this Schema, in case that helps.

I know this does sound like a lot of work… unfortunately Firehose isn’t part of the supported stack so I’m not sure of a less manual way. Ultimately the task is to figure out how what’s going wrong with parsing - my bet is either it’s to do with thrift encoding or the newline problem I mentioned.

For future reference, if you use our S3 loader to dump the data into a bucket with lzo compression, then in future cases we could run the (now deprecated) old batch pipeline over the data to get it to enriched-good format in S3, ready for the shredder/loader.

I hope that’s helpful! Let us know if we can be of use in figuring things out.

Topic		Replies	Views
Replay data from S3 AWS real-time pipeline	3	2639	February 14, 2018
Parquet - how to get it from Enriched Stream For engineers	4	2048	November 24, 2020
Raw Data from Kinesis Stream to S3 Contains NO Keys For engineers	8	1106	November 26, 2018
Enriched event stream into Redshift using Kinesis Firehose AWS real-time pipeline	7	5766	May 31, 2016
S3-loader not getting from enrich stream For engineers	1	719	June 3, 2020

Replay collector data from s3 firehose files to enrich

Related topics