Aggregate snowplow event metrics for analytics

Hello Experts,

We have setup our SP on GCP GKE and we are trying to compare the events hitting collector and the events inserted into BQ good_events table for analytics purposes.

I am trying to understand what would be the best approach in order to achieve this.

Any help would be much appreciated.

@mike

Hi @ashish_george you should be able to find the relevant telemetry metrics in GCP Monitoring to compare these.

For the Collector you are looking for Load Balancer metrics - specifically the rate of 2xx + 3xx response codes from the Load Balancer → this is the number of requests the Collector processed. The loadbalancing.googleapis.com/https/request_count with a filter on response codes and for your specific load balancer should give you what you need.

You can then look at ingest rates on the PubSub topics where the “raw” topic is what the Collector has published → again number of requests received. The pubsub.googleapis.com/topic/send_message_operation_count as a SUM should give you what you want.

In BigQuery itself (this metric is often slow to update so be careful) you can monitor the bigquery.googleapis.com/storage/uploaded_row_count.

Hope this helps!

1 Like

Thank you @josh for pointing me to the right direction.
I did setup the monitoring like you suggested, but i found some values are not matching.
i have setup the below monitoring.

For collector load balancer: loadbalancing.googleapis.com/https/request_count filtered by the Load balancer name and status code >=200 and <400 and i get a count of 3.03k/day
For pubsub collector RAW Good topic used: pubsub.googleapis.com/topic/send_request_count as pubsub.googleapis.com/topic/send_message_operation_count was showing deprecated and the events count was nearly 1m/day.

Similarly i setup the metric for the below and got these values.
Load Balancer request_count (Count 3.03k/day)
Collector raw Good topic(Count 1.072M/day)
Collector raw bad topic(Count 0)
Enriched good topic(Count 913.18K/day)
Enriched bad topic(count 3.29k/day)
Types topic(Count 3.16M/day) and
BQ good_events(Count 129.32M/day)

Had a couple of doubts,

  1. The LB request_count shows 3.03k/day and the collector raw topic shows 1.072M/day, any idea why there is a huge difference here, not sure if i am missing something here?
  2. Was trying to tally the collector raw topic count(1.072M/day) to whats there in BQ(129.32M/day) There seems to be a good difference, is this because of the duplicate entries in BQ due to insert retry?
  3. Any idea why does the types topic have Count 3.16M/day?

Finding it a bit hard to tally the values, not sure if i am understating the data correctly.

Thanks Josh.

So some discrepancy is expected as these metrics are sampled somewhat so would not expect exact counts - however they should generally be within a 10% tolerance so something is definitely not right here!

For the LB request_count can you check that your time window is set correctly? A POST to the Collector should be an event pushed to the raw topic so these should be very close.

For the others I am not sure - before we dig too deeply into it could you share the exact queries you are issuing against GCP monitoring just to see if there is no bug in the query?

yes the LB request_count time window is set for a day and below is the query set.

fetch https_lb_rule
| metric 'loadbalancing.googleapis.com/https/request_count'
| filter
    (resource.url_map_name
     == 'k8s2-um-xx-snowplow-collector-ingress-xxx')
    && (metric.response_code_class >= 200 && metric.response_code_class < 400)
| group_by 1d, [row_count: row_count()]
| every 1d
| group_by [metric.response_code], [row_count: row_count()]

For all the other topics below is the query used.

fetch pubsub_topic
| metric 'pubsub.googleapis.com/topic/send_request_count'
| filter (resource.topic_id == 'collector-prod-raw-good')
| group_by 1d,
    [value_send_request_count_aggregate: aggregate(value.send_request_count)]
| every 1d
| group_by [resource.topic_id],
    [value_send_request_count_aggregate_aggregate:
       aggregate(value_send_request_count_aggregate)]

Hi @ashish_george I am wondering if you are not using the correct aligner and reducer in your query to GCP Metrics.

The query I have for the Collector LB:

{
  "dataSets": [
    {
      "timeSeriesFilter": {
        "filter": "metric.type=\"loadbalancing.googleapis.com/https/request_count\" resource.type=\"https_lb_rule\" metric.label.\"response_code_class\">=\"200\" metric.label.\"response_code_class\"<\"400\" resource.label.\"url_map_name\"=\"k8s-um-REDACTED\" resource.label.\"project_id\"=\"REDACTED\"",
        "minAlignmentPeriod": "86400s",
        "aggregations": [
          {
            "perSeriesAligner": "ALIGN_SUM",
            "crossSeriesReducer": "REDUCE_SUM",
            "alignmentPeriod": "86400s",
            "groupByFields": []
          },
          {
            "crossSeriesReducer": "REDUCE_NONE",
            "alignmentPeriod": "60s",
            "groupByFields": []
          }
        ]
      },
      "targetAxis": "Y1",
      "plotType": "LINE"
    }
  ],
  "options": {
    "mode": "COLOR"
  },
  "constantLines": [],
  "timeshiftDuration": "0s",
  "y1Axis": {
    "label": "y1Axis",
    "scale": "LINEAR"
  }
}

By default this metric is a “rate” so you need to aggregate as a SUM and align as a SUM to get an actual count of information coming out.

1 Like

I tried using your query, but its throwing error at line 2 “dataSets”: [ so i referenced your query and modified mine as below.

fetch https_lb_rule
| metric 'loadbalancing.googleapis.com/https/request_count'
| filter
    resource.project_id == '12345678'
    &&
    (resource.url_map_name
     == 'k8s2-um-snowplow-newsid-prod-collector-ingres-hhyat')
    && (metric.response_code_class >= 200 && metric.response_code_class < 400)
| group_by 1d, [value_request_count_aggregate: aggregate(value.request_count)]
| every 1d
| group_by [],
    [value_request_count_aggregate_aggregate:
       aggregate(value_request_count_aggregate)]

And i changed my pub/sub queries for all the topics shown below as well

fetch pubsub_topic
| metric 'pubsub.googleapis.com/topic/send_message_operation_count'
| filter (resource.topic_id == 'sp-prod-raw-good')
| group_by 1d,
    [value_send_message_operation_count_aggregate:
       aggregate(value.send_message_operation_count)]
| every 1d
| group_by [],
    [value_send_message_operation_count_aggregate_aggregate:
       aggregate(value_send_message_operation_count_aggregate)]

Below are the counts i got for all the topics after updating the queries.

Metrics GCP Resource Count
Collector LB Request Count Load Balancer 158.54M
Collector RAW Good Topic 133.62M
Collector RAW Bad Topic 0
Enriched Good Topic 138.01M
Enriched Bad Topic 4.03K
Stream Loader - Types Topic 13.8M
BigQuey - good_events Big Query Table 136.83M

Again how can be Enriched Good event count > Collector RAW Good event count

Edit for posterity: While the below is true, it’s not likely relevant in this thread - after reading the context I realised that Josh’s explanations above are a better fit (as I clarify below with specifics). Leaving it here in case it does explain someone’s issue in future, but the specific issue here seems not to be this.

Again how can be Enriched Good event count > Collector RAW Good event count

The collector raw stream contains payloads, which may or may not be multiple events batched together.

The enriched good stream contains individual events. If you send a post request containing 20 events to the collector, the raw stream will have a count of 1 and the enriched stream will have a count of 20.

Thanks @Colm for your response, so these metrics makes sense?

Looking at the context in this thread, I’m not sure my previous answer is actually helpful.

It is true that if you’re batching requests, you’d expect the raw count to be lower than the enriched count. But your numbers here are close enough that this isn’t likely the explanation (it depends on what batching settings you’ve set in the tracker). Apologies for that, I answered with a theory without looking at the context.

I actually think both of these things are explained well above in the thread:

Josh knows a lot more than me on this topic, but the numbers you have here seem to fit the interpretation that the under-the-hood sampling would be responsible for variance in the numbers.

I would assume that the difference between LB count and the rest might be down to this part:

Everything else is close enough that sampling or metric lag resonably explains the variance to me.

2 Likes

Thank you @Colm and @josh for helping out with this, really appreciate, great help.

Any idea why the collector raw bad event count could be 0? Since millions of events are hitting the collector, hard to believe there is not even a single raw bad event.

2 Likes

It’s generally pretty hard to get bad rows emitted from the collector unless you are doing so deliberately. You either need a seriously malformed querystring (that Akka complains about parsing a generic_error) or a really large event (10 mb I think for PubSub, a size_violation) that can’t be broken up into smaller individual events and sent to PubSub. In most pipelines both of these are reasonably rare occurrences.

2 Likes

Thanks @mike like always.
So it’s safe to assume collector raw_bad event count could be 0.
We have it all sorted out then.
Thanks @Colm @josh @mike for helping me out. Really appreciate.

2 Likes