I’m totally new to Snowplow and I’m just digging through extensive documentation. I managed to set-up a proof of concept by using Apache Kafka for collector & enrichment phase instead of Kinesis, but then I hit the wall and have no idea how to move my data to something suitable for further, both batch and real-time, analysis - PostgreSQL for example. I grepped 4-storage folder and there is no single mention of Kafka, after some research I found out that whole Kafka support is quite new. But if there is this support, there must be some way to use the data collected by Kafka?
My question is - once I’m past enrichment phase, and data is back in Kafka topic (post_enrichment one), how do I store it in PostgreSQL or Elastic? Or I shouldn’t pick Kafka as the final post-enrich store and move data to something, for which there is an existing loader/sink?
Sorry for such basic question but I couldn’t find any remark on this, and I found the hard way trying first storage-loader then kinesis-elasticsearch-sink and both didn’t even mention Kafka support.
As you’ve spotted, our Kafka support is very new and incomplete. In terms of the specific storage targets you mention:
We plan on adding support for Kafka to the kinesis-elasticsearch-sink (which we’ll rename stream-elasticsearch-sink), but this will require some heavy re-factoring, and we will only start on this in the New Year
We have no plans to support Kafka cluster to single-node Postgres
If you have any other distributed analytical databases you would be interested in us supporting from Kafka (EventQL? Greenplum? Clickhouse? Vertica? Redshift?) then let us know
At the moment, as a workaround if you wanted to load the data into Elasticsearch, you could write a simple Kafka-to-Kinesis bridge and then connect that into kinesis-elasticsearch-sink
Thanks for explaining your plans regarding Kafka. We wanted to use Kafka with Elasticsearch (real-time) and PostgreSQL (batch) backends. First one seems to be achievable, as you pointed, by writing Kafka->Kinesis or just straight Kafka->Elasticsearch converter. The other I have no idea how to approach.
About Kafka to Postgres, do you mean straight Kafka->Postgres? Couldn’t it work in a similar way to Kinesis LZO S3 sink + EmrEtlRunner + StorageLoad? Or it’s completly different thing?
Since we’re doing it as a proof of concept now and don’t have many resources that we can put into writing own converters, we’ll probably try the Amazon approach now. We’ll get Kinesis + S3 and see how the whole pipeline works for us.
Hi @noizex - sorry, you’re right - if you could get the raw events from Kafka to S3 (using a hypothetical equivalent of our kinesis-s3 component), then you could totally run the batch pipepline on EMR and load into Postgres.
My point was more that Postgres being a non-clustered database, it doesn’t make sense for us to spend time supporting loading Postgres directly from a Kafka cluster - the Postgres load would bottleneck badly.
Where are you with the support for Kafka to kinesis? I am doing a quick proof of concept for our group and have managed to have the collector write to Kafka and enrichment read and write to Kafka. I now need to get the data back to a Kinesis stream to support our existing ElasticSearch and other downstream clients of the data. I am investigating using Amazon’s KCL and rolling my own, but if there is something already implemented I would be very happy to try that.
Hey @charlie - we didn’t get round to this, but it shouldn’t be too difficult to write a generic Kafka to Kinesis bridge, probably using something like Kafka Connect. It doesn’t even need to be Snowplow-specific (can simply mirror the Kafka stream)…
Has there been any update on an enricher for the kafka stream?
Right now we are tryign to store data in postgreSQL however to do so we are pulling from the collector and seperating the TSV from our kafka collector. We got it to sort of work with: https://github.com/snowplow/snowplow/wiki/setting-up-stream-enrich However we are still pulling and manipulating the tsv.
We want to be able to stream straight into a postgreSQL DB however we need to keep all data internal, IE, no cloud services.