Loading And Processing Bad Events [Apache Kafka]

Jayant_Kumar · October 26, 2023, 3:45am

Hi Folks,

I am trying to understand your experiences and approaches for handling bad side output from collector and enrichment stages.

What is the data format for these bad events, and how are you processing and loading it to object or blob storage?

I am thinking of loading it into Elasticsearch for text analytics. Currently, I am using Kafka for streaming, and I am not sure if we have an existing loader for it out of the box.

Please feel free to share your thoughts and opinions. Thank you.

mike · October 26, 2023, 6:17am

The format is JSON and most pipelines use the S3 loader (https://docs.snowplow.io/docs/pipeline-components-and-applications/loaders-storage-targets/s3-loader/) in AWS to load from Kinesis to S3 or the GCS loader on GCP.

Depending on what analysis you want to do and your data volume I wouldn’t recommend Elasticsearch for this purpose as it doesn’t tend to scale particularly well.

If you are using Kafka I’d recommend using the S3 Sink Connector.

Jayant_Kumar · October 26, 2023, 6:47am

Thank you @mike I think I am clear on the bad stream format and the loading now. Thanks for pointing out the Kafka Connector part, which is really helpful.

Are there any other alternatives that you prefer in place of Elasticsearch? The bad event volume may not be very high.
But as per the use case, we would want the simplest way to analyse and visualize the bad events for debugging.

Topic		Replies	Views
Elasticsearch loading on GCP GCP pipeline	5	856	May 12, 2023
ElasticSearch Loader 0.12.1 Crashes and Schema Errors AWS real-time pipeline	1	1315	October 26, 2020
Kafka Pipeline message format docs Kafka real-time pipeline	3	1213	April 28, 2023
How to stream bad events into s3 using flink Job AWS real-time pipeline	16	1877	March 11, 2021
Kafka to BigQuery/GCS loader Storage targets	15	1173	October 28, 2023

Loading And Processing Bad Events [Apache Kafka]

Related topics