Enrich with Kafka

Hi @magaton,

It’s definitely a thoughtprovoking proposal. My thoughts:

especially if you want to keep kinesis as an option

Certainly we have no plans to remove support for Kinesis - it is working great for the Snowplow community as a minimal-ops Kafka alternative on AWS.

Building on our Kinesis success, we are now also actively exploring Azure Event Hubs (see our recent RFC) and Google Cloud Pub/Sub (see our recent example project).

These hosted unified log technologies are great for those who can’t afford for a 24/7 ops team with deep Kafka experience.

kafka cluster with enabled kafka rest proxy … - there is no need now for a separate app that serves as collector

I have a few concerns with this approach:

  1. REST is a poor paradigm for event collection - CQRS is the methodology we use
  2. The Snowplow collectors perform some important analytics functions such as cookies and redirects
  3. I’d be nervous about exposing a read/write API to your company’s Kafka cluster on the open internet

I can see … avro schema registry as a kind of equivalent of … iglu … I read your arguments about Iglu + Thrift vs Avro in another thread.

Confluent schema registry is a strict subset of Iglu, as covered in the last sub-section in this post:

We believe that we can support Confluent schema registry as a downstream “lossy mirror” of an Iglu schema registry, but we cannot use it as our primary schema registry without losing most of the capabilities that make Snowplow, Snowplow.

app that runs kafka streams so that the collected data get transformed / aggregated / enriched in different ways and end up in different topics.

Yes, definitely - there is nothing stopping a Kafka user of Snowplow from writing their own awesome Kafka Streams (or Flink or Beam or Spark Streaming) jobs that work on the event stream and do all sorts of cool data modeling and aggregation.

That’s the nice thing about Snowplow’s async micro-service-based architecture, running on Kafka or Kinesis or Event Hub - you can write any app which plugs into our enriched event topic.

From there the data can be sent to ES, Spark, Neo4j, FS, RDBMS using available kafka connect sinks, depends what kind of data analytics / storage you need

We can’t use the generic Kafka Connect sinks because our enriched event model is too rich, and we want to support some pretty advanced behaviors (e.g. hot mutation of RDBMS tables to accommodate evolving schemas), but yes, there’s nothing stopping us from writing our own sinks on top of Kafka Connect.

I think that given the very similar semantics between Kafka, Kinesis and Event Hubs (albeit not Google Cloud Pub/Sub), there are some interesting opportunities for us around generalizing Kafka Connect further, so it (or at least its patterns) are reusable across all these unified log technologies.

This app is either meant to be a custom implementation or there can be kind of generic code with groovy based on the fly transformation like in divolte collector, for instance.

Yes, we already support JavaScript-based enrichments:

We would love to add JRuby, Jython, Groovy to this. If you were interested in contributing any of these, please let us know!

Phew! Hopefully I’ve covered everything. I think I would sum up by saying: we are excited by and impressed by what is happening in the Kafka/Confluent ecosystem, and we want to leverage as much of this as possible, but without reducing the capability-set of Snowplow, and without making things overly-specific to Kafka in a multi-unified-log world.

3 Likes