Stream Collector is passing empty message body to PubSub Enrich

Hi everyone! I am experiencing a strange error in my GCP pipeline consisting of the Scala Stream Collector, PubSub Enrich and GCS Loader. Messages which pass validation in Snowplow Micro (same iglu resolver config) are failing in PubSub Enrich with an “enriched bad” message stating that the Collector is passing messages to Enrich with an empty body/querystring. Here is the JSON of the result in the “enriched bad” stream:

{
  "schema":"iglu:com.snowplowanalytics.snowplow.badrows/tracker_protocol_violations/jsonschema/1-0-0",
  "data":{
    "processor":{
      "artifact":"snowplow-enrich-pubsub",
      "version":"2.0.1"
    },
    "failure":{
      "timestamp":"2021-09-17T01:06:50.549928Z",
      "vendor":"com.snowplowanalytics.snowplow",
      "version":"tp2",
      "messages":[
        {
          "field":"body",
          "value":null,
          "expectation":"empty body: not a valid tracker protocol event"
        },
        {
          "field":"querystring",
          "value":null,
          "expectation":"empty querystring: not a valid tracker protocol event"
        }
      ]
    },
    "payload":{
      "vendor":"com.snowplowanalytics.snowplow",
      "version":"tp2",
      "querystring":[
        
      ],
      "contentType":null,
      "body":null,
      "collector":"ssc-2.3.0-googlepubsub",
      "encoding":"UTF-8",
      "hostname":"sp.palmetto.com",
      "timestamp":"2021-09-17T01:06:38.957Z",
      "ipAddress":"**.***.**.**",
      "useragent":"Go-http-client/2.0",
      "refererUri":"http://sp.palmetto.com/com.snowplowanalytics.snowplow/tp2",
      "headers":[
        "Timeout-Access: <function1>",
        "Host: sp.palmetto.com",
        "Referer: http://sp.palmetto.com/com.snowplowanalytics.snowplow/tp2",
        "User-Agent: Go-http-client/2.0",
        "x-cloud-trace-context: cc302d6a5a5dd8776a07e46cc56b42b3/8833614303760384032",
        "traceparent: 00-cc302d6a5a5dd8776a07e46cc56b42b3-7a974d6822728020-00",
        "X-Forwarded-For: 72.219.70.50",
        "X-Forwarded-Proto: https",
        "forwarded: for=\"72.219.70.50\";proto=https",
        "Accept-Encoding: gzip"
      ],
      "networkUserId":"01eab90b-eb02-45bb-8e84-e40ee9ecd451"
    }
  }
}

I previously saw this error when we first deployed the pipeline, and it seemed to resolve itself after we expanded the max-uri-length field back to 32768 in the collector config. We recently re-deployed in a new GCP region without any configuration changes, so I was surprised to see this error again.

Does anyone have any idea why this issue might be occurring?

Are you able to share the request from the tracker which seems to be generating this in order to replicate?

The querystring (or body) of the request needs to be present in order for Snowplow to process the event, which is why it’s raising this error so it’s either failing to process the event for some reason or data isn’t being sent to the collector in the first place.

Hi mike, yes this is the full tracking call, sent with the snowplow-tracking-cli:

$ snowplow-tracking-cli --collector="sp.palmetto.com" --appid='testing' --method=POST --protocol=http --sdjson='{"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.palmetto/test_schema/jsonschema/1-0-0","data":{}}}'

The schema test_schema is registered in our schema registry, which has been added to the iglu resolver configuration file in the Stream Collector deployed in Google Cloud Run with the Snowplow provided docker image, scala-stream-collector-pubsub:2.3.0.

I don’t quite understand your statement. The data could only be processed by the PubSub Enrich process if an event is successfully collected and output by the Stream collector into our collected-good stream, which PubSub Enrich reads in.

I am aware that the body needs to be present. My question is, does anyone know why the collector might not pass along the querystring in the first place?

The data is making it to the collector, but it can still be passed onto the enricher even if the querystring / body is missing. There doesn’t look like there’s anything wrong with that tracking-cli call, though I’d be tempted to be testing it with HTTPS as it has been set up on that collector.

As this is a POST request we expect the query string to be null which is fine, but the body should be getting removed, and isn’t subject to max-uri-length constraints.

Is there anything in front of Cloud Run doing any load balancing or forwarding? The only thing I can think of is that something is not forwarding the full payload of the request on to Cloud Run.

1 Like

There is nothing in front of Cloud Run. We are using Cloud Run as our compute service rather than Compute Engine instances because it automatically manages load balancing, and since we have a very low-flow pipeline it only spins up when a message comes in to the PubSub topic that the Cloud Run instance is listening on.

To emphasize what I said in my original point, we had this running perfectly–the full pipeline–until we redeployed, with the only change being the new GCP region. Originally the pipeline was deployed in us-east1 and re-deployed to us-central1. It seems like our original fix of expanding the max-uri-length turned out to be a fluke but now we have absolutely no clue why an old bug has come back to haunt us after we had a previously perfectly functioning pipeline.