Count number of messages produced by GCSloader for collector

Hanumanth · November 8, 2021, 4:16am

Hi,

I was building an Automation framework for testing. So, I was writing a java code to count the number of messages produced by GCSloader for the collector to ensure nothing is missing from the tracker.

Question: do you have the simplest way to count the number of messages? Because I see the collector output format is thrift.

Thanks,
Hanumanth

BenB · November 8, 2021, 8:33am

Hi @Hanumanth ,

What is the input of your test ? Is it data on GCS ?

If what you want to test is the collector, you don’t need GCS loader, you can read directly the output of the collector in PubSub.

One thrift payload === 1 tracker event, so all you need is to count the payloads. If you need to do some parsing, you need to use something like this.

Hanumanth · November 8, 2021, 8:49am

Hi @BenB ,

Basically, I need to test the messages flowing from the collector to Bq(whole pipeline by count). So, we are storing both good and bad events into GCS for future reference. Hence, we are maintaining gcs loader for every component so that we can have a copy of the messages for each component.

Attaching the example payload of the collector. Processing: uat_gcsloader_collector_2021_11_03_01_output-2021-11-03T01_30_00.000Z-2021-11-03T01_35_00.000Z-pane-0-last-00000-of-00001.txt…

do you suggest the best way to count the payloads looking at this?

Thanks,
Hanumanth

mike · November 8, 2021, 9:37am

If you want to count the payloads, you can just count the number of lines in that file (assuming you are line delimiting them) or count the number of messages in the PubSub raw queue.

If you want to count the number of events then as Ben mentioned you’ll need to decode the serialised Thrift record using the collector-payload schema and count the events that way (as one payload may contain 1 or more events).

Hanumanth · November 8, 2021, 12:37pm

Can we convert the thrift output format into csv using the below option? is it available?

BenB · November 8, 2021, 1:50pm

No you can’t, if you want to parse Thrift you need dedicated code for this (the one that I shared).

Why do you want to convert into csv if all what you need is the count ? As Mike said for that you only need to count the number of lines in your files.

Hanumanth · November 9, 2021, 4:44am

Thank you @BenB for the clarification. So, similarly, we can count enrichers’ output as well for count validation with one another.

I mean it should be the number of lines in collector’s good record = the number of lines in enricher’s good record + the number of lines in enricher’s bad record as input for enricher is processed events of the collector.

BenB · November 9, 2021, 8:07am

It depends on your tracking. 1 tracking event == 1 collector payload. But tracking event/collector payload can contain multiple actual events, depending on your tracking implementation, and if that’s the case then several enriched events and bad rows can correspond to one tracking event/collector payload.

Hanumanth · November 10, 2021, 8:05am

Hi @BenB ,

Thank you for the information. Just had one more question. How does gcs loader create the following directory? Is it based on current_timestamp or something else?

mike · November 10, 2021, 9:38pm

It’s based on the current date / time of the partition. If you’d like to change it you can do so by modifying the --dateFormat option to the GCS loader.

Hanumanth · November 11, 2021, 4:07am

Thanks, @mike. We thought it was based on collector_timestamp or some other timestamps that are captured in snowplow events.

So, it is the current_date at which gcs loader is loading data.

mike · November 11, 2021, 4:21am

Yes - technically it is around the current timestamp, namely the filename is based on the UTC end time of the window - which is why you’ll see the minutes ending in 0 or 5 if your windows are 5 minutes long.

Topic		Replies	Views
Count events in GCS Loader created file GCP pipeline	9	1243	November 25, 2021
Archiving raw events in GCP Collectors	2	1117	February 25, 2020
Aggregate snowplow event metrics for analytics GCP pipeline	12	1044	October 27, 2022
Processing Collector's Raw Good Events Collectors	5	634	October 26, 2023
Stream Transformer SQS Message count.good = 0 For engineers	3	605	April 29, 2022

Count number of messages produced by GCSloader for collector

Related topics