How to configure a kafka collector and a HDFS sink

vishwas · December 1, 2016, 12:44pm

I need to track data coming from kafka and after processing it I need to dump the data to HDFS.
Can someone please help me out to configure the same in snowplow - Kafka tracker and HDFS as sink

alex · December 1, 2016, 1:21pm

Hi @vishwas - we don’t have a component to do this yet I’m afraid.

Are you talking about storing the enriched events to HDFS, or the raw collector payloads?

vishwas · December 1, 2016, 5:47pm

Storing the enriched events to HDFS.
Collecting the data from kafka, enriching the data and then storing the enriched data to HDFS… is this possible through snowplow?

alex · December 1, 2016, 6:33pm

Hi @vishwas - we don’t currently have a component for this, but it’s something we’ll look at building in the New Year. In the meantime, I’d suggest doing a Google search for “Kafka to HDFS” and exploring the results which come up.

vishwas · January 4, 2017, 11:53am

Hi @alex - Is there a possibilty to collect the data from kafka topic, enrich the data and push it back to kafka as another kafka topic?? If so, Could you please help in configuring the same… Thanks

alex · January 4, 2017, 12:39pm

Yes indeed it is possible - we don’t have documentation on the wiki yet but the blog post should help you:

http://snowplowanalytics.com/blog/2016/11/15/snowplow-r85-metamorphosis-released-with-beta-apache-kafka-support/

vishwas · January 4, 2017, 12:48pm

Hi @alex - Thanks for the information. I went through the documentation, One more clarification is required. How to configure collector to use kafka as a source(to collect the data from a kafka topic…)
Overall flow is Kafka topic --> Enrichment --> Kafka topic. Need to set up this pipeline in snowplow…
Thanks…

alex · January 4, 2017, 1:02pm

Hi @vishwas - I think you’re a bit confused on the terminology: an event collector receives events over HTTP; there’s no concept of Kafka (or Kinesis or S3 or …) as a collector’s source. Of course Stream Enrich can take Kinesis or Kafka as a source.

jahstreet · February 18, 2020, 12:05pm

Hi @alex, currently we are running Snowplow pipeline with ASW Kinesis as a message queue. One of the steps of our pipeline is persisting enriched events from Kinesis to S3 with Snowplow S3 Loader. We want to migrate from Kinesis to Kafka and and cannot find the replacement for that step since S3 Loader doesn’t support Kafka source.

One possible option is to use Kafka S3 Connector, but then we break the subsequent steps of the pipeline.

Do you have on the roadmap supporting Kafka source for Snowplow S3 Loader and is there any timeline on that?

Topic		Replies	Views
Enrich with Kafka Kafka real-time pipeline	6	7630	May 16, 2017
Collect to Kafka, enrich with Kafka then what? Storage targets	15	5452	September 17, 2018
Bulk import of old events into Snowplow from Apache Kafka For engineers	4	781	January 10, 2020
Kafka-elasticsearch sink Kafka real-time pipeline	4	3128	April 5, 2017
Why is Snowplow using Kinesis/Kafka for real-time pipeline? AWS real-time pipeline	4	6038	July 12, 2016

How to configure a kafka collector and a HDFS sink

Related topics