Basically, I need to test the messages flowing from the collector to Bq(whole pipeline by count). So, we are storing both good and bad events into GCS for future reference. Hence, we are maintaining gcs loader for every component so that we can have a copy of the messages for each component.
If you want to count the payloads, you can just count the number of lines in that file (assuming you are line delimiting them) or count the number of messages in the PubSub raw queue.
If you want to count the number of events then as Ben mentioned you’ll need to decode the serialised Thrift record using the collector-payload schema and count the events that way (as one payload may contain 1 or more events).
Thank you @BenB for the clarification. So, similarly, we can count enrichers’ output as well for count validation with one another.
I mean it should be the number of lines in collector’s good record = the number of lines in enricher’s good record + the number of lines in enricher’s bad record as input for enricher is processed events of the collector.
It depends on your tracking. 1 tracking event == 1 collector payload. But tracking event/collector payload can contain multiple actual events, depending on your tracking implementation, and if that’s the case then several enriched events and bad rows can correspond to one tracking event/collector payload.
Yes - technically it is around the current timestamp, namely the filename is based on the UTC end time of the window - which is why you’ll see the minutes ending in 0 or 5 if your windows are 5 minutes long.