I am new to snowplow and I’m having a problem where I want to use a value in the enriched data, a user id, as the key for a key-value message when Snowplow sinks to Kafka. Right now, for each message Snowplow sinks to Kafka, the key is a uniquely generated string and the value is a tab separated string. However, I want the key to be the user id, the (nth) column in the value, so that I know that each user’s data will go to the same partition every single time. Is there any way I can choose the value of the key? If not, are there any other ways I could solve this problem? Any help would be great.
I’m not experienced with Kafka, but I believe what you want to do can be achieved with Snowplow Analytics SDKs, particularly EventTransformer from Scala SDK. Using EventTransformer, you can parse enriched event TSV as JSON object, where each key correspond to one from our canonical model, so you can group by (or however it called in Kafka) enriched events by user_id.
At the moment I don’t think this is possible inside Snowplow.
However, as Anton suggested, you can always reprocess the enriched data in any way you see fit, applying a different partitioning scheme being one of them.
This could easily be done using streaming framework such as Kafka Streams or Spark Streaming for example.