Count events in GCS Loader created file

siv · November 22, 2021, 3:39am

Hey,

I have the snowplow pipeline setup on GCP with a few additional components than mentioned in the setup guide.

I have the following GCS Loaders setup with their respective PubSub subscriptions.

Collector Good GCS Loader
Collector Bad GCS Loader
Enricher Good GCS Loader
Enricher Bad GCS Loader
BQ Bad GCS Loader

The above mentioned loaders write in 5 minute interval.

I was wondering if there was a way to get the count of events in each of these files. These numbers would help with the GCP dashboard that will be created soon.

Thank You!

enes_aldemir · November 23, 2021, 7:46am

Hey @siv ,

GCS Loader doesn’t have direct support for this as far as I am aware. However, since every event corresponds to a line in the file, the number of lines is equal to the count of events.

You can use the approach described here to find the number of lines in every uploaded file. Basically, you will create a new Google Cloud Function which is triggered on a new file upload on the bucket and send the result wherever you want.

siv · November 24, 2021, 12:13pm

Hey @enes_aldemir

Thank you for the response!

The method you suggested works for all GCS Loader created files except for the ones created by the GCS Loader that writes the collector good records into the bucket. In this file each record takes up 2 or 3 lines or more which is inconsistent.

mike · November 24, 2021, 9:43pm

The collector good records are Thrift serialised - so in this case line breaks alone won’t give you an event count. These payloads are before enrichment so you are technically counting payloads rather than events so you may have fewer payloads when compared to events.

siv · November 25, 2021, 6:06am

Hi @mike,

So essentially a single payload can have multiple events right?

Also, where can I find the definitions for the the payload_data properties? (such as for

)

Thank you

mike · November 25, 2021, 6:46am

Yes, a single payload (sent to the collector) can have multiple events associated with it.

You can find these definitions in the tracker protocol reference here.

siv · November 25, 2021, 7:12am

Thanks a ton @mike!

Also, can collector_tstamp be found in the thrift record?

mike · November 25, 2021, 9:28am

Yes - collector timestamp is the timestamp field in the Thrift record (as an epoch timestamp).

siv · November 25, 2021, 9:45am

I tried finding it in the following, could you help me with it?

d   
     125.16.91.5
 �  |��xO
          �   UTF-8
                    �   ssc-2.3.0-googlepubsub
                                              ,   snowplow/andr-2.2.1 android/11
                                                                                @   #/com.snowplowanalytics.snowplow/tp2
                                                                                                                        T
�{"schema":"iglu:com.snowplowanalytics.snowplow\/payload_data\/jsonschema\/1-0-4","data":[{"eid":"a1e92e28-a4e8-425c-8817-670f7861d8cb","res":"1080x1794","tv":"andr-2.2.1","e":"ue","tna":"AndroidApp","tz":"Asia\/Calcutta","stm":"1635843273771","p":"mob","cx":"eyz.....","ue_px":"eyJ....","ue_px":"eyJ....","dtm":"1635843273580","lang":"English","aid":"android"}]}^fQ.BpsuH7BZsaVuhXLvJduJR-pJa9Q9BjPsphgCzSjDoVDg0GeZuL2RpPdxcByjqy__cV_LKMzoxTm4lbUoQDi61uP63XDNxwvgr0bypfWcwU4KT9DsFDIiP_74hCk0_bQTythshTBFb__fjwjYMj-4SqdDGwef_6w6uTOfkqloV-eqbwdOY_AELXfKgZsS_Lj4lbonE46DHZbJMMvB_GHWjBEVDLZFNfNSU4YYW0otMAx24_yzAGhyjFYTE3Fiyniy5ZEsWowGFmZlrXTHLxjvkFsR6K3sTtidZGYPc_w6_JLUcCeLxGkQh      imeout-Access: <function1>   4Host: snowploinal-pw-collector-uat-zxbopxaa2q-an.a.run.app   *User-Agent: snowplow/andr-2.2.1 android/11   2x-api-key: AIza.......   Nx-cloud-trace-context: 268b7233379d32922a6cb42bfd2c3f8d/291564721745030813;o=1   x-client-data: CgSM6ZsV   Dtraceparent: 00-268b7233379d32922a6cb42bfd2c3f8d-040bd891d4a
5ee9d-01   ,X-Forwarded-For: 125.16.91.5, 107.178.234.94   X-Forwarded-Proto: https   Iforwarded: for="125.16.91.5";proto=https,for="107.178.234.94";proto=https   2x-request-id: 11cd7c31-de5b-4b86-9d5d-9cb872f616f6   'x-apigateway-api-consumer-type: PROJECT   .x-apigateway-api-consumer-number: 151289950298  �
Authorization: Bearer eyJ............   :x-envoy-original-path: /com.snowplowanalytics.snowplow/tp2   Accept-Encoding: gzip   application/json
                                                                                   h   application/json
                                                                                                       �   .snowplow-collector-uat-zxbopxaa2q-an.a.run.app
                                                                                                                                                          �   $08068585-a28e-4547-9ade-eaf0cfa4a29b
                                                                                                                                                                                                   zi   Aiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0

Thank you!

mike · November 25, 2021, 10:04am

It’ll be part of these bytes here (as an int64). This is serialised at the moment so you’ll want to deserialise these bytes into an object before you can easily read them out. I wrote an example of how to do this here - that was for bad rows specifically but the same principal applies to these Thrift records.

Topic		Replies	Views
Count number of messages produced by GCSloader for collector Collectors	11	1001	November 11, 2021
Aggregate snowplow event metrics for analytics GCP pipeline	12	1037	October 27, 2022
Archiving raw events in GCP Collectors	2	1115	February 25, 2020
Full trace of an event from tracker to bigquery GCP pipeline	3	948	January 20, 2022
Querying Failed BigQuery Events in GCS GCP pipeline	2	841	January 24, 2023

Count events in GCS Loader created file

Related topics