Collect to Kafka, enrich with Kafka then what?

noizex · December 2, 2016, 4:32pm

Hello,
I’m totally new to Snowplow and I’m just digging through extensive documentation. I managed to set-up a proof of concept by using Apache Kafka for collector & enrichment phase instead of Kinesis, but then I hit the wall and have no idea how to move my data to something suitable for further, both batch and real-time, analysis - PostgreSQL for example. I grepped 4-storage folder and there is no single mention of Kafka, after some research I found out that whole Kafka support is quite new. But if there is this support, there must be some way to use the data collected by Kafka?

My question is - once I’m past enrichment phase, and data is back in Kafka topic (post_enrichment one), how do I store it in PostgreSQL or Elastic? Or I shouldn’t pick Kafka as the final post-enrich store and move data to something, for which there is an existing loader/sink?

Sorry for such basic question but I couldn’t find any remark on this, and I found the hard way trying first storage-loader then kinesis-elasticsearch-sink and both didn’t even mention Kafka support.

Best regards,
K.

alex · December 2, 2016, 9:03pm

Hi @noizex - it’s a fair question!

As you’ve spotted, our Kafka support is very new and incomplete. In terms of the specific storage targets you mention:

We plan on adding support for Kafka to the kinesis-elasticsearch-sink (which we’ll rename stream-elasticsearch-sink), but this will require some heavy re-factoring, and we will only start on this in the New Year
We have no plans to support Kafka cluster to single-node Postgres
If you have any other distributed analytical databases you would be interested in us supporting from Kafka (EventQL? Greenplum? Clickhouse? Vertica? Redshift?) then let us know

At the moment, as a workaround if you wanted to load the data into Elasticsearch, you could write a simple Kafka-to-Kinesis bridge and then connect that into kinesis-elasticsearch-sink

Thanks,

Alex

noizex · December 5, 2016, 9:59am

Hi @alex, thanks for quick reply

Thanks for explaining your plans regarding Kafka. We wanted to use Kafka with Elasticsearch (real-time) and PostgreSQL (batch) backends. First one seems to be achievable, as you pointed, by writing Kafka->Kinesis or just straight Kafka->Elasticsearch converter. The other I have no idea how to approach.

About Kafka to Postgres, do you mean straight Kafka->Postgres? Couldn’t it work in a similar way to Kinesis LZO S3 sink + EmrEtlRunner + StorageLoad? Or it’s completly different thing?

Since we’re doing it as a proof of concept now and don’t have many resources that we can put into writing own converters, we’ll probably try the Amazon approach now. We’ll get Kinesis + S3 and see how the whole pipeline works for us.

Cheers,
noizex

alex · December 5, 2016, 10:15am

Hi @noizex - sorry, you’re right - if you could get the raw events from Kafka to S3 (using a hypothetical equivalent of our kinesis-s3 component), then you could totally run the batch pipepline on EMR and load into Postgres.

My point was more that Postgres being a non-clustered database, it doesn’t make sense for us to spend time supporting loading Postgres directly from a Kafka cluster - the Postgres load would bottleneck badly.

iFire · December 5, 2016, 11:12pm

Hi @alex,

What do you think about supporting Snappydata as the database and Spark store? My github has work on a Kubernetes version of Snappydata, but I can port it back to back to ansible.

The flow is collector -> kafka-s3 or kafka-hdfs -> Snappydata (sql and spark).

There might be a possibility of helping Snowplow get this functionality.

SnappyData is a cluster sql database and that means it can scale.

iFire · December 5, 2016, 11:22pm

Using a kafka to hdfs component (needs to be custom made) means the system can use Alluxio as a s3 to hdfs bridge for both Spark and SnappyData.

Streamsets is another collector system that allows kafka to s3 / jdbc / hdfs. The only problem is you already have a collector / enricher system. It’s yet another layer.

alex · December 6, 2016, 12:01am

Snappydata sounds like an interesting idea - keep the suggestions coming!

spatialy · March 1, 2017, 4:02pm

Hi @noizex @alex

@alex can you think the new Kafka Connector can do the job?

@noizex you solve the problem to port from Kafka -> Postgress?? any insights from what you learned?

Best

alex · March 1, 2017, 10:17pm

Hi @spatialy - what job are you referring to?

BenFradet · March 2, 2017, 3:28pm

Hey @spatialy @noizex ,

If you’re still looking to move data from Kafka to Postgres I’d suggest looking at kafka-connect-jdbc.

Best,
Ben.

spatialy · March 3, 2017, 6:52pm

Hi, @alex I refer to move data to a database … as @BenFradet noted below seem to be a Kafka native solution from the last month or so.

best

simplesteph · April 11, 2017, 12:56am

Ideally Kafka -> anything should be a Kafka Connector and completely out of scope for Snowplow, as long as Snowplow pushes the data to Kafka in a Kafka friendly format (json or avro)

Then Snowplow could provide examples and guidance to configure the kafka connector to s3 from confluent for example, which would show how to do good usage.

charlie · December 14, 2017, 4:18pm

Hi @alex,
Where are you with the support for Kafka to kinesis? I am doing a quick proof of concept for our group and have managed to have the collector write to Kafka and enrichment read and write to Kafka. I now need to get the data back to a Kinesis stream to support our existing ElasticSearch and other downstream clients of the data. I am investigating using Amazon’s KCL and rolling my own, but if there is something already implemented I would be very happy to try that.

Thanks,

Charlie

alex · December 14, 2017, 8:07pm

Hey @charlie - we didn’t get round to this, but it shouldn’t be too difficult to write a generic Kafka to Kinesis bridge, probably using something like Kafka Connect. It doesn’t even need to be Snowplow-specific (can simply mirror the Kafka stream)…

builtbyproxy · September 17, 2018, 4:05am

Has there been any update on an enricher for the kafka stream?
Right now we are tryign to store data in postgreSQL however to do so we are pulling from the collector and seperating the TSV from our kafka collector. We got it to sort of work with: https://github.com/snowplow/snowplow/wiki/setting-up-stream-enrich However we are still pulling and manipulating the tsv.

We want to be able to stream straight into a postgreSQL DB however we need to keep all data internal, IE, no cloud services.

Thanks!

alex · September 17, 2018, 9:26am

Hi @builtbyproxy - do you mean an enricher, or a loader?

Our Stream Enrich component has supported Kafka for a long while, and that continues to be maintained and enhanced.

In terms of a loader for an on-premise data warehouse, the question/note from earlier in this thread still live:

Postgres is an interesting one - although I suspect most on-premise users of Snowplow would need something more OLAP-focused and horizontally scalable.

Topic		Replies	Views
Enrich with Kafka Kafka real-time pipeline	6	7630	May 16, 2017
Why is Snowplow using Kinesis/Kafka for real-time pipeline? AWS real-time pipeline	4	6038	July 12, 2016
Kafka-elasticsearch sink Kafka real-time pipeline	4	3128	April 5, 2017
Storage to Data Modeling to Analytics Data store sources	8	1796	December 22, 2020
How to configure a kafka collector and a HDFS sink For engineers	8	2282	February 18, 2020

Collect to Kafka, enrich with Kafka then what?

Related topics