Processing Collector's Raw Good Events

Hi Folks,

I am not sure about the data format used by collectors for good streams. I am using Kafka for streaming, and I see this kind of message that I am not sure about.

Is it thrift binary payload?

Another question that I have is, if I have to load it for storage or use it for stream processing, any idea how I can achieve that with the raw events?

Payload:

d	127.0.0.1
��^?��UTF-8�ssc-2.9.2-kafka,
curl/8.1.2@#/com.snowplowanalytics.snowplow/tp2T�{
  "schema": "iglu:com.snowplowanalytics.snowplow/payload_data/jsonschema/1-0-4",
  "data": [
    {
      "e": "pv",
      "url": "https://snowplow.io/",
      "page": "Snowplow: behavioral data creation leader",
      "eid": "2ec13853-ef9e-414a-8976-e64a4693b134",
      "tv": "js-3.5.0",
      "tna": "biz1",
      "aid": "website",
      "p": "web",
      "cookie": "1",
      "cs": "UTF-8",
      "lang": "en-GB",
      "res": "3440x1440",
      "cd": "30",
      "tz": "Europe/London",
      "dtm": "1665398554084",
      "vp": "693x1302",
      "ds": "678x8188",
      "vid": "29",
      "sid": "be9520e7-16a5-4d4e-afa1-8e269f99a1cf",
      "duid": "85094061-f702-4b62-a46d-20f7226b4741",
      "stm": "1665398554084"
    }
  ]
}^eTimeout-Access: <function1>Host: localhost:8080User-Agent: curl/8.1.2Accept: */*application/jsonhapplication/json�	localhost�$1a4a9cec-a1af-4d6a-a0ae-5c93b79be74cziAiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0

Yes - it’s a Thrift binary record. You can decode it using any Thrift library and the corresponding schema. More information in the docs here.

Hi @mike Thank you. This is exactly what I was looking for. So I think it’s the same collector payload schema that could be used to deserialize the raw events. This is pretty straight-forward in that case.

How about the bad stream? I think format for that would be JSON. I am not sure about the schema though.

@Jayant_Kumar Failed events are described here: Understanding failed events | Snowplow Documentation. Each has its own schema: Iglu central.

I am curious about how you are planning to use these — ideally we would find a way for you to just run one of our standard loaders :slight_smile:

Thank you @stanch It helps. I am planning to have at least two syncs for each good and bad streams at the collector(raw) and the enrich layers.

  1. Cold Storage: Archive these events to GCS from Kafka. If needed, we can do backfilling for enrichment later on. Considering that none of the existing loaders work for us, planning to leverage Apache Spark for it.

  2. Hot Storage: Storing it to a TSDB for analysis and reporting. We have OpenTelemetry and Prometheus-Graffana setup.

It would be great if you can share your thoughts on this.

The only complexity that I see here is to fetch schemas from iglu and use it in Spark/Flink, partition data by using the version and date, which allow it to evolve.