Replay collector data from s3 firehose files to enrich

So this might be a tricky one, depending on what firehose is doing with the data.

I know others have run into issues with enriched data using Firehose, where the issue was essentially down to Firehose not adding a newline between events (I think this was one such relevant thread).

It does seem weird that firehose wouldn’t have internal consistency with the aws-kinesis-agent you linked, but it’s an area that I’m not terribly familiar with, so I’m not certain.

Assuming it is some similar issue, I reckon one way of going about it is to download one of the files, and try to identify a single collector payload, see if you can send just one alone into the stream successfully, and identify from there how to write a script to do that for the rest of the data. Collector payloads are thrift-encoded according to this Schema, in case that helps.

I know this does sound like a lot of work… unfortunately Firehose isn’t part of the supported stack so I’m not sure of a less manual way. Ultimately the task is to figure out how what’s going wrong with parsing - my bet is either it’s to do with thrift encoding or the newline problem I mentioned.

For future reference, if you use our S3 loader to dump the data into a bucket with lzo compression, then in future cases we could run the (now deprecated) old batch pipeline over the data to get it to enriched-good format in S3, ready for the shredder/loader.

I hope that’s helpful! Let us know if we can be of use in figuring things out.