We have setup our SP on GCP GKE and we are trying to compare the events hitting collector and the events inserted into BQ good_events table for analytics purposes.
I am trying to understand what would be the best approach in order to achieve this.
Hi @ashish_george you should be able to find the relevant telemetry metrics in GCP Monitoring to compare these.
For the Collector you are looking for Load Balancer metrics - specifically the rate of 2xx + 3xx response codes from the Load Balancer → this is the number of requests the Collector processed. The loadbalancing.googleapis.com/https/request_count with a filter on response codes and for your specific load balancer should give you what you need.
You can then look at ingest rates on the PubSub topics where the “raw” topic is what the Collector has published → again number of requests received. The pubsub.googleapis.com/topic/send_message_operation_count as a SUM should give you what you want.
In BigQuery itself (this metric is often slow to update so be careful) you can monitor the bigquery.googleapis.com/storage/uploaded_row_count.
Thank you @josh for pointing me to the right direction.
I did setup the monitoring like you suggested, but i found some values are not matching.
i have setup the below monitoring.
Similarly i setup the metric for the below and got these values.
Load Balancer request_count (Count 3.03k/day)
Collector raw Good topic(Count 1.072M/day)
Collector raw bad topic(Count 0)
Enriched good topic(Count 913.18K/day)
Enriched bad topic(count 3.29k/day)
Types topic(Count 3.16M/day) and
BQ good_events(Count 129.32M/day)
Had a couple of doubts,
The LB request_count shows 3.03k/day and the collector raw topic shows 1.072M/day, any idea why there is a huge difference here, not sure if i am missing something here?
Was trying to tally the collector raw topic count(1.072M/day) to whats there in BQ(129.32M/day) There seems to be a good difference, is this because of the duplicate entries in BQ due to insert retry?
Any idea why does the types topic have Count 3.16M/day?
Finding it a bit hard to tally the values, not sure if i am understating the data correctly.
So some discrepancy is expected as these metrics are sampled somewhat so would not expect exact counts - however they should generally be within a 10% tolerance so something is definitely not right here!
For the LB request_count can you check that your time window is set correctly? A POST to the Collector should be an event pushed to the raw topic so these should be very close.
For the others I am not sure - before we dig too deeply into it could you share the exact queries you are issuing against GCP monitoring just to see if there is no bug in the query?
Edit for posterity: While the below is true, it’s not likely relevant in this thread - after reading the context I realised that Josh’s explanations above are a better fit (as I clarify below with specifics). Leaving it here in case it does explain someone’s issue in future, but the specific issue here seems not to be this.
Again how can be Enriched Good event count > Collector RAW Good event count
The collector raw stream contains payloads, which may or may not be multiple events batched together.
The enriched good stream contains individual events. If you send a post request containing 20 events to the collector, the raw stream will have a count of 1 and the enriched stream will have a count of 20.
Looking at the context in this thread, I’m not sure my previous answer is actually helpful.
It is true that if you’re batching requests, you’d expect the raw count to be lower than the enriched count. But your numbers here are close enough that this isn’t likely the explanation (it depends on what batching settings you’ve set in the tracker). Apologies for that, I answered with a theory without looking at the context.
I actually think both of these things are explained well above in the thread:
Josh knows a lot more than me on this topic, but the numbers you have here seem to fit the interpretation that the under-the-hood sampling would be responsible for variance in the numbers.
I would assume that the difference between LB count and the rest might be down to this part:
Everything else is close enough that sampling or metric lag resonably explains the variance to me.
Thank you @Colm and @josh for helping out with this, really appreciate, great help.
Any idea why the collector raw bad event count could be 0? Since millions of events are hitting the collector, hard to believe there is not even a single raw bad event.
It’s generally pretty hard to get bad rows emitted from the collector unless you are doing so deliberately. You either need a seriously malformed querystring (that Akka complains about parsing a generic_error) or a really large event (10 mb I think for PubSub, a size_violation) that can’t be broken up into smaller individual events and sent to PubSub. In most pipelines both of these are reasonably rare occurrences.
Thanks @mike like always.
So it’s safe to assume collectorraw_bad event count could be 0.
We have it all sorted out then.
Thanks @Colm@josh@mike for helping me out. Really appreciate.