Aggregate snowplow event metrics for analytics

ashish_george · October 19, 2022, 2:57pm

Hello Experts,

We have setup our SP on GCP GKE and we are trying to compare the events hitting collector and the events inserted into BQ good_events table for analytics purposes.

I am trying to understand what would be the best approach in order to achieve this.

Any help would be much appreciated.

@mike

josh · October 19, 2022, 10:50pm

Hi @ashish_george you should be able to find the relevant telemetry metrics in GCP Monitoring to compare these.

For the Collector you are looking for Load Balancer metrics - specifically the rate of 2xx + 3xx response codes from the Load Balancer → this is the number of requests the Collector processed. The loadbalancing.googleapis.com/https/request_count with a filter on response codes and for your specific load balancer should give you what you need.

You can then look at ingest rates on the PubSub topics where the “raw” topic is what the Collector has published → again number of requests received. The pubsub.googleapis.com/topic/send_message_operation_count as a SUM should give you what you want.

In BigQuery itself (this metric is often slow to update so be careful) you can monitor the bigquery.googleapis.com/storage/uploaded_row_count.

Hope this helps!

ashish_george · October 20, 2022, 2:50pm

Thank you @josh for pointing me to the right direction.
I did setup the monitoring like you suggested, but i found some values are not matching.
i have setup the below monitoring.

For collector load balancer: loadbalancing.googleapis.com/https/request_count filtered by the Load balancer name and status code >=200 and <400 and i get a count of 3.03k/day
For pubsub collector RAW Good topic used: pubsub.googleapis.com/topic/send_request_count as pubsub.googleapis.com/topic/send_message_operation_count was showing deprecated and the events count was nearly 1m/day.

Similarly i setup the metric for the below and got these values.
Load Balancer request_count (Count 3.03k/day)
Collector raw Good topic(Count 1.072M/day)
Collector raw bad topic(Count 0)
Enriched good topic(Count 913.18K/day)
Enriched bad topic(count 3.29k/day)
Types topic(Count 3.16M/day) and
BQ good_events(Count 129.32M/day)

Had a couple of doubts,

The LB request_count shows 3.03k/day and the collector raw topic shows 1.072M/day, any idea why there is a huge difference here, not sure if i am missing something here?
Was trying to tally the collector raw topic count(1.072M/day) to whats there in BQ(129.32M/day) There seems to be a good difference, is this because of the duplicate entries in BQ due to insert retry?
Any idea why does the types topic have Count 3.16M/day?

Finding it a bit hard to tally the values, not sure if i am understating the data correctly.

Thanks Josh.

josh · October 20, 2022, 10:10pm

So some discrepancy is expected as these metrics are sampled somewhat so would not expect exact counts - however they should generally be within a 10% tolerance so something is definitely not right here!

For the LB request_count can you check that your time window is set correctly? A POST to the Collector should be an event pushed to the raw topic so these should be very close.

For the others I am not sure - before we dig too deeply into it could you share the exact queries you are issuing against GCP monitoring just to see if there is no bug in the query?

ashish_george · October 21, 2022, 9:50am

yes the LB request_count time window is set for a day and below is the query set.

fetch https_lb_rule
| metric 'loadbalancing.googleapis.com/https/request_count'
| filter
    (resource.url_map_name
     == 'k8s2-um-xx-snowplow-collector-ingress-xxx')
    && (metric.response_code_class >= 200 && metric.response_code_class < 400)
| group_by 1d, [row_count: row_count()]
| every 1d
| group_by [metric.response_code], [row_count: row_count()]

For all the other topics below is the query used.

fetch pubsub_topic
| metric 'pubsub.googleapis.com/topic/send_request_count'
| filter (resource.topic_id == 'collector-prod-raw-good')
| group_by 1d,
    [value_send_request_count_aggregate: aggregate(value.send_request_count)]
| every 1d
| group_by [resource.topic_id],
    [value_send_request_count_aggregate_aggregate:
       aggregate(value_send_request_count_aggregate)]

josh · October 24, 2022, 10:22pm

Hi @ashish_george I am wondering if you are not using the correct aligner and reducer in your query to GCP Metrics.

The query I have for the Collector LB:

{
  "dataSets": [
    {
      "timeSeriesFilter": {
        "filter": "metric.type=\"loadbalancing.googleapis.com/https/request_count\" resource.type=\"https_lb_rule\" metric.label.\"response_code_class\">=\"200\" metric.label.\"response_code_class\"<\"400\" resource.label.\"url_map_name\"=\"k8s-um-REDACTED\" resource.label.\"project_id\"=\"REDACTED\"",
        "minAlignmentPeriod": "86400s",
        "aggregations": [
          {
            "perSeriesAligner": "ALIGN_SUM",
            "crossSeriesReducer": "REDUCE_SUM",
            "alignmentPeriod": "86400s",
            "groupByFields": []
          },
          {
            "crossSeriesReducer": "REDUCE_NONE",
            "alignmentPeriod": "60s",
            "groupByFields": []
          }
        ]
      },
      "targetAxis": "Y1",
      "plotType": "LINE"
    }
  ],
  "options": {
    "mode": "COLOR"
  },
  "constantLines": [],
  "timeshiftDuration": "0s",
  "y1Axis": {
    "label": "y1Axis",
    "scale": "LINEAR"
  }
}

By default this metric is a “rate” so you need to aggregate as a SUM and align as a SUM to get an actual count of information coming out.

ashish_george · October 26, 2022, 10:26am

I tried using your query, but its throwing error at line 2 “dataSets”: [ so i referenced your query and modified mine as below.

fetch https_lb_rule
| metric 'loadbalancing.googleapis.com/https/request_count'
| filter
    resource.project_id == '12345678'
    &&
    (resource.url_map_name
     == 'k8s2-um-snowplow-newsid-prod-collector-ingres-hhyat')
    && (metric.response_code_class >= 200 && metric.response_code_class < 400)
| group_by 1d, [value_request_count_aggregate: aggregate(value.request_count)]
| every 1d
| group_by [],
    [value_request_count_aggregate_aggregate:
       aggregate(value_request_count_aggregate)]

And i changed my pub/sub queries for all the topics shown below as well

fetch pubsub_topic
| metric 'pubsub.googleapis.com/topic/send_message_operation_count'
| filter (resource.topic_id == 'sp-prod-raw-good')
| group_by 1d,
    [value_send_message_operation_count_aggregate:
       aggregate(value.send_message_operation_count)]
| every 1d
| group_by [],
    [value_send_message_operation_count_aggregate_aggregate:
       aggregate(value_send_message_operation_count_aggregate)]

Below are the counts i got for all the topics after updating the queries.

Metrics	GCP Resource	Count
Collector LB Request Count	Load Balancer	158.54M
Collector RAW Good	Topic	133.62M
Collector RAW Bad	Topic	0
Enriched Good	Topic	138.01M
Enriched Bad	Topic	4.03K
Stream Loader - Types	Topic	13.8M
BigQuey - good_events	Big Query Table	136.83M

Again how can be Enriched Good event count > Collector RAW Good event count

Colm · October 26, 2022, 10:39am

Edit for posterity: While the below is true, it’s not likely relevant in this thread - after reading the context I realised that Josh’s explanations above are a better fit (as I clarify below with specifics). Leaving it here in case it does explain someone’s issue in future, but the specific issue here seems not to be this.

Again how can be Enriched Good event count > Collector RAW Good event count

The collector raw stream contains payloads, which may or may not be multiple events batched together.

The enriched good stream contains individual events. If you send a post request containing 20 events to the collector, the raw stream will have a count of 1 and the enriched stream will have a count of 20.

ashish_george · October 26, 2022, 4:25pm

Thanks @Colm for your response, so these metrics makes sense?

Colm · October 26, 2022, 5:02pm

Looking at the context in this thread, I’m not sure my previous answer is actually helpful.

It is true that if you’re batching requests, you’d expect the raw count to be lower than the enriched count. But your numbers here are close enough that this isn’t likely the explanation (it depends on what batching settings you’ve set in the tracker). Apologies for that, I answered with a theory without looking at the context.

I actually think both of these things are explained well above in the thread:

Josh knows a lot more than me on this topic, but the numbers you have here seem to fit the interpretation that the under-the-hood sampling would be responsible for variance in the numbers.

I would assume that the difference between LB count and the rest might be down to this part:

Everything else is close enough that sampling or metric lag resonably explains the variance to me.

ashish_george · October 27, 2022, 7:29am

Thank you @Colm and @josh for helping out with this, really appreciate, great help.

Any idea why the collector raw bad event count could be 0? Since millions of events are hitting the collector, hard to believe there is not even a single raw bad event.

mike · October 27, 2022, 10:41am

It’s generally pretty hard to get bad rows emitted from the collector unless you are doing so deliberately. You either need a seriously malformed querystring (that Akka complains about parsing a generic_error) or a really large event (10 mb I think for PubSub, a size_violation) that can’t be broken up into smaller individual events and sent to PubSub. In most pipelines both of these are reasonably rare occurrences.

ashish_george · October 27, 2022, 1:02pm

Thanks @mike like always.
So it’s safe to assume collector raw_bad event count could be 0.
We have it all sorted out then.
Thanks @Colm @josh @mike for helping me out. Really appreciate.

Topic		Replies	Views
Full trace of an event from tracker to bigquery GCP pipeline	3	948	January 20, 2022
Missing data even collector return 200 Collectors	1	1020	August 27, 2021
Issues on GCP streamloader GCP pipeline	3	1107	March 22, 2022
Count events in GCS Loader created file GCP pipeline	9	1243	November 25, 2021
BigQuery Stream Loader becomes very non-performant after processing large numbers of events GCP pipeline	4	1348	January 27, 2023

Aggregate snowplow event metrics for analytics

Related topics