I am trying to understand your experiences and approaches for handling bad side output from collector and enrichment stages.
- What is the data format for these bad events, and how are you processing and loading it to object or blob storage?
I am thinking of loading it into Elasticsearch for text analytics. Currently, I am using Kafka for streaming, and I am not sure if we have an existing loader for it out of the box.
Please feel free to share your thoughts and opinions. Thank you.
The format is JSON and most pipelines use the S3 loader (
https://docs.snowplow.io/docs/pipeline-components-and-applications/loaders-storage-targets/s3-loader/) in AWS to load from Kinesis to S3 or the GCS loader on GCP.
Depending on what analysis you want to do and your data volume I wouldn’t recommend Elasticsearch for this purpose as it doesn’t tend to scale particularly well.
If you are using Kafka I’d recommend using the S3 Sink Connector.
Thank you @mike I think I am clear on the bad stream format and the loading now. Thanks for pointing out the Kafka Connector part, which is really helpful.
Are there any other alternatives that you prefer in place of Elasticsearch? The bad event volume may not be very high.
But as per the use case, we would want the simplest way to analyse and visualize the bad events for debugging.