I have the snowplow pipeline setup on GCP with a few additional components than mentioned in the setup guide.
I have the following GCS Loaders setup with their respective PubSub subscriptions.
Collector Good GCS Loader
Collector Bad GCS Loader
Enricher Good GCS Loader
Enricher Bad GCS Loader
BQ Bad GCS Loader
The above mentioned loaders write in 5 minute interval.
I was wondering if there was a way to get the count of events in each of these files. These numbers would help with the GCP dashboard that will be created soon.
GCS Loader doesn’t have direct support for this as far as I am aware. However, since every event corresponds to a line in the file, the number of lines is equal to the count of events.
You can use the approach described here to find the number of lines in every uploaded file. Basically, you will create a new Google Cloud Function which is triggered on a new file upload on the bucket and send the result wherever you want.
The method you suggested works for all GCS Loader created files except for the ones created by the GCS Loader that writes the collector good records into the bucket. In this file each record takes up 2 or 3 lines or more which is inconsistent.
The collector good records are Thrift serialised - so in this case line breaks alone won’t give you an event count. These payloads are before enrichment so you are technically counting payloads rather than events so you may have fewer payloads when compared to events.
It’ll be part of these bytes here (as an int64). This is serialised at the moment so you’ll want to deserialise these bytes into an object before you can easily read them out. I wrote an example of how to do this here - that was for bad rows specifically but the same principal applies to these Thrift records.