Count events in GCS Loader created file

Hey,

I have the snowplow pipeline setup on GCP with a few additional components than mentioned in the setup guide.

I have the following GCS Loaders setup with their respective PubSub subscriptions.

  • Collector Good GCS Loader
  • Collector Bad GCS Loader
  • Enricher Good GCS Loader
  • Enricher Bad GCS Loader
  • BQ Bad GCS Loader

The above mentioned loaders write in 5 minute interval.

I was wondering if there was a way to get the count of events in each of these files. These numbers would help with the GCP dashboard that will be created soon.

Thank You!

Hey @siv ,

GCS Loader doesn’t have direct support for this as far as I am aware. However, since every event corresponds to a line in the file, the number of lines is equal to the count of events.

You can use the approach described here to find the number of lines in every uploaded file. Basically, you will create a new Google Cloud Function which is triggered on a new file upload on the bucket and send the result wherever you want.

Hey @enes_aldemir

Thank you for the response!

The method you suggested works for all GCS Loader created files except for the ones created by the GCS Loader that writes the collector good records into the bucket. In this file each record takes up 2 or 3 lines or more which is inconsistent.

The collector good records are Thrift serialised - so in this case line breaks alone won’t give you an event count. These payloads are before enrichment so you are technically counting payloads rather than events so you may have fewer payloads when compared to events.

1 Like

Hi @mike,

So essentially a single payload can have multiple events right?

Also, where can I find the definitions for the the payload_data properties? (such as for


)

Thank you

Yes, a single payload (sent to the collector) can have multiple events associated with it.

You can find these definitions in the tracker protocol reference here.

1 Like

Thanks a ton @mike!

Also, can collector_tstamp be found in the thrift record?

Yes - collector timestamp is the timestamp field in the Thrift record (as an epoch timestamp).

I tried finding it in the following, could you help me with it?

d   
     125.16.91.5
 �  |��xO
          �   UTF-8
                    �   ssc-2.3.0-googlepubsub
                                              ,   snowplow/andr-2.2.1 android/11
                                                                                @   #/com.snowplowanalytics.snowplow/tp2
                                                                                                                        T
�{"schema":"iglu:com.snowplowanalytics.snowplow\/payload_data\/jsonschema\/1-0-4","data":[{"eid":"a1e92e28-a4e8-425c-8817-670f7861d8cb","res":"1080x1794","tv":"andr-2.2.1","e":"ue","tna":"AndroidApp","tz":"Asia\/Calcutta","stm":"1635843273771","p":"mob","cx":"eyz.....","ue_px":"eyJ....","ue_px":"eyJ....","dtm":"1635843273580","lang":"English","aid":"android"}]}^fQ.BpsuH7BZsaVuhXLvJduJR-pJa9Q9BjPsphgCzSjDoVDg0GeZuL2RpPdxcByjqy__cV_LKMzoxTm4lbUoQDi61uP63XDNxwvgr0bypfWcwU4KT9DsFDIiP_74hCk0_bQTythshTBFb__fjwjYMj-4SqdDGwef_6w6uTOfkqloV-eqbwdOY_AELXfKgZsS_Lj4lbonE46DHZbJMMvB_GHWjBEVDLZFNfNSU4YYW0otMAx24_yzAGhyjFYTE3Fiyniy5ZEsWowGFmZlrXTHLxjvkFsR6K3sTtidZGYPc_w6_JLUcCeLxGkQh      imeout-Access: <function1>   4Host: snowploinal-pw-collector-uat-zxbopxaa2q-an.a.run.app   *User-Agent: snowplow/andr-2.2.1 android/11   2x-api-key: AIza.......   Nx-cloud-trace-context: 268b7233379d32922a6cb42bfd2c3f8d/291564721745030813;o=1   x-client-data: CgSM6ZsV   Dtraceparent: 00-268b7233379d32922a6cb42bfd2c3f8d-040bd891d4a
5ee9d-01   ,X-Forwarded-For: 125.16.91.5, 107.178.234.94   X-Forwarded-Proto: https   Iforwarded: for="125.16.91.5";proto=https,for="107.178.234.94";proto=https   2x-request-id: 11cd7c31-de5b-4b86-9d5d-9cb872f616f6   'x-apigateway-api-consumer-type: PROJECT   .x-apigateway-api-consumer-number: 151289950298  �
Authorization: Bearer eyJ............   :x-envoy-original-path: /com.snowplowanalytics.snowplow/tp2   Accept-Encoding: gzip   application/json
                                                                                   h   application/json
                                                                                                       �   .snowplow-collector-uat-zxbopxaa2q-an.a.run.app
                                                                                                                                                          �   $08068585-a28e-4547-9ade-eaf0cfa4a29b
                                                                                                                                                                                                   zi   Aiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0

Thank you!

It’ll be part of these bytes here (as an int64). This is serialised at the moment so you’ll want to deserialise these bytes into an object before you can easily read them out. I wrote an example of how to do this here - that was for bad rows specifically but the same principal applies to these Thrift records.

1 Like