I need to track data coming from kafka and after processing it I need to dump the data to HDFS.
Can someone please help me out to configure the same in snowplow - Kafka tracker and HDFS as sink
Hi @vishwas - we don’t have a component to do this yet I’m afraid.
Are you talking about storing the enriched events to HDFS, or the raw collector payloads?
Storing the enriched events to HDFS.
Collecting the data from kafka, enriching the data and then storing the enriched data to HDFS… is this possible through snowplow?
Hi @vishwas - we don’t currently have a component for this, but it’s something we’ll look at building in the New Year. In the meantime, I’d suggest doing a Google search for “Kafka to HDFS” and exploring the results which come up.
Hi @alex - Is there a possibilty to collect the data from kafka topic, enrich the data and push it back to kafka as another kafka topic?? If so, Could you please help in configuring the same… Thanks
Yes indeed it is possible - we don’t have documentation on the wiki yet but the blog post should help you:
Hi @alex - Thanks for the information. I went through the documentation, One more clarification is required. How to configure collector to use kafka as a source(to collect the data from a kafka topic…)
Overall flow is Kafka topic --> Enrichment --> Kafka topic. Need to set up this pipeline in snowplow…
Thanks…
Hi @vishwas - I think you’re a bit confused on the terminology: an event collector receives events over HTTP; there’s no concept of Kafka (or Kinesis or S3 or …) as a collector’s source. Of course Stream Enrich can take Kinesis or Kafka as a source.
Hi @alex, currently we are running Snowplow pipeline with ASW Kinesis as a message queue. One of the steps of our pipeline is persisting enriched events from Kinesis to S3 with Snowplow S3 Loader. We want to migrate from Kinesis to Kafka and and cannot find the replacement for that step since S3 Loader doesn’t support Kafka source.
One possible option is to use Kafka S3 Connector, but then we break the subsequent steps of the pipeline.
Do you have on the roadmap supporting Kafka source for Snowplow S3 Loader and is there any timeline on that?