Processing Collector's Raw Good Events

Jayant_Kumar · October 26, 2023, 3:54am

Hi Folks,

I am not sure about the data format used by collectors for good streams. I am using Kafka for streaming, and I see this kind of message that I am not sure about.

Is it thrift binary payload?

Another question that I have is, if I have to load it for storage or use it for stream processing, any idea how I can achieve that with the raw events?

Payload:

d	127.0.0.1
��^?��UTF-8�ssc-2.9.2-kafka,
curl/8.1.2@#/com.snowplowanalytics.snowplow/tp2T�{
  "schema": "iglu:com.snowplowanalytics.snowplow/payload_data/jsonschema/1-0-4",
  "data": [
    {
      "e": "pv",
      "url": "https://snowplow.io/",
      "page": "Snowplow: behavioral data creation leader",
      "eid": "2ec13853-ef9e-414a-8976-e64a4693b134",
      "tv": "js-3.5.0",
      "tna": "biz1",
      "aid": "website",
      "p": "web",
      "cookie": "1",
      "cs": "UTF-8",
      "lang": "en-GB",
      "res": "3440x1440",
      "cd": "30",
      "tz": "Europe/London",
      "dtm": "1665398554084",
      "vp": "693x1302",
      "ds": "678x8188",
      "vid": "29",
      "sid": "be9520e7-16a5-4d4e-afa1-8e269f99a1cf",
      "duid": "85094061-f702-4b62-a46d-20f7226b4741",
      "stm": "1665398554084"
    }
  ]
}^eTimeout-Access: <function1>Host: localhost:8080User-Agent: curl/8.1.2Accept: */*application/jsonhapplication/json�	localhost�$1a4a9cec-a1af-4d6a-a0ae-5c93b79be74cziAiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0

mike · October 26, 2023, 6:07am

Yes - it’s a Thrift binary record. You can decode it using any Thrift library and the corresponding schema. More information in the docs here.

Jayant_Kumar · October 26, 2023, 6:36am

Hi @mike Thank you. This is exactly what I was looking for. So I think it’s the same collector payload schema that could be used to deserialize the raw events. This is pretty straight-forward in that case.

How about the bad stream? I think format for that would be JSON. I am not sure about the schema though.

stanch · October 26, 2023, 9:34am

@Jayant_Kumar Failed events are described here: Understanding failed events | Snowplow Documentation. Each has its own schema: Iglu central.

I am curious about how you are planning to use these — ideally we would find a way for you to just run one of our standard loaders

Jayant_Kumar · October 26, 2023, 10:13am

Thank you @stanch It helps. I am planning to have at least two syncs for each good and bad streams at the collector(raw) and the enrich layers.

Cold Storage: Archive these events to GCS from Kafka. If needed, we can do backfilling for enrichment later on. Considering that none of the existing loaders work for us, planning to leverage Apache Spark for it.
Hot Storage: Storing it to a TSDB for analysis and reporting. We have OpenTelemetry and Prometheus-Graffana setup.

It would be great if you can share your thoughts on this.

Jayant_Kumar · October 26, 2023, 10:16am

The only complexity that I see here is to fetch schemas from iglu and use it in Spark/Flink, partition data by using the version and date, which allow it to evolve.

Topic		Replies	Views
Kafka Pipeline message format docs Kafka real-time pipeline	3	1212	April 28, 2023
Archiving raw events in GCP Collectors	2	1114	February 25, 2020
Bulk import of old events into Snowplow from Apache Kafka For engineers	4	777	January 10, 2020
Help tracking form data via unstructured event & kafka Kafka real-time pipeline	3	2810	April 13, 2017
Scala Stream collector integration with Kafka For engineers	8	2921	July 25, 2016

Processing Collector's Raw Good Events

Related topics