BQ Mutator not adding column from custom schema

Running into some issues trying to test out a custom unstruct event that we are trying to build into our pipeline. I am pretty sure I have the schema built ok, and I pushed it to my iglu server with the igluctl tool, but when I try to run the mutator it always errors out with the same unclear message.

Any thoughts from someone more succesful in adding custom events?

Here is my workflow:


Check iglu for my schema

$ curl <GCP.iglu.server.URL>:8080/api/schemas/com.acme_company/viewed_product/jsonschema/2-0-0 -X GET -H "apikey: ########-####-####-####-############" | json_pp
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   729  100   729    0     0   4418      0 --:--:-- --:--:-- --:--:--  4418
{
   "description" : "Schema for Custom Dimension Test",
   "type" : "object",
   "$schema" : "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
   "properties" : {
      "contentParagraphCount" : {
         "type" : "string"
      },
      "premiumStatus" : {
         "type" : "boolean"
      },
      "videoTitle" : {
         "type" : "string"
      },
      "contentType" : {
         "type" : "string"
      },
      "domain" : {
         "type" : "string"
      },
      "contentWordCount" : {
         "type" : "integer"
      },
      "contentAuthor" : {
         "type" : "string"
      },
      "premiumEndDate" : {
         "type" : "string"
      },
      "contentPubHour" : {
         "type" : "string"
      },
      "contentId" : {
         "type" : "string"
      },
      "contentPubDate" : {
         "type" : "string"
      },
      "contentLastModifiedDate" : {
         "type" : "string"
      }
   },
   "additionalProperties" : false,
   "self" : {
      "version" : "2-0-0",
      "format" : "jsonschema",
      "name" : "viewed_product",
      "vendor" : "com.acme_company"
   }
}

Looks good (AFAIK) but could it have a problem with the version being 2-0-0? I have an identical schema with v.1-0-0 but I am using the 2-0-0 to match the tracker the web dev built on the test website. I wouldn’t expect the version to cause it to crap out.

Lets try out adding these fields to BQ from my local machine (mac) w/ docker desktop running:

$ docker run \
    -v /snowplow/config:/snowplow/config \
    -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/key.json \
    snowplow/snowplow-bigquery-mutator:1.3.0-distroless \
    add-column \
    --config $(cat /snowplow/config/base64/sp-bigquery-streamloader-config_b64) \
    --resolver $(cat /snowplow/config/base64/iglu_resolver_b64) \
    --shred-property=CONTEXTS \
    --schema="iglu:com.acme_company/viewed_product/jsonschema/2-0-0"                                                                                 
[ioapp-compute-0] INFO com.snowplowanalytics.snowplow.storage.bigquery.mutator.Main - load_tstamp already exists
com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator$MutatorError
	at com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator$.com$snowplowanalytics$snowplow$storage$bigquery$mutator$Mutator$$fetchError(Mutator.scala:167)
	at com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator.$anonfun$getSchema$1(Mutator.scala:105)
	at cats.data.EitherT.$anonfun$bimap$1(EitherT.scala:383)
	at timeout @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator.getSchema(Mutator.scala:105)
	at map @ com.snowplowanalytics.iglu.client.Client$.parseDefault(Client.scala:58)
	at flatMap @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator$.initialize(Mutator.scala:157)
	at flatMap @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator.getSchema(Mutator.scala:110)
	at flatMap @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator.addField(Mutator.scala:73)
	at map @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Main$.$anonfun$run$24(Main.scala:66)
	at delay @ org.typelevel.log4cats.slf4j.internal.Slf4jLoggerInternal$Slf4jLogger.$anonfun$info$4(Slf4jLoggerInternal.scala:91)
	at delay @ org.typelevel.log4cats.slf4j.internal.Slf4jLoggerInternal$Slf4jLogger.isInfoEnabled(Slf4jLoggerInternal.scala:66)
	at ifM$extension @ org.typelevel.log4cats.slf4j.internal.Slf4jLoggerInternal$Slf4jLogger.info(Slf4jLoggerInternal.scala:91)
	at apply @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.TableReference$BigQueryTable.getTable(TableReference.scala:39)
	at flatMap @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.TableReference$BigQueryTable.getTable(TableReference.scala:39)
	at map @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.TableReference$BigQueryTable.getFields(TableReference.scala:45)
	at flatMap @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator.addField(Mutator.scala:84)
	at liftF @ com.snowplowanalytics.iglu.client.resolver.Resolver$.$anonfun$parse$3(Resolver.scala:269)
	at map @ com.snowplowanalytics.iglu.client.Client$.parseDefault(Client.scala:58)
	at map @ com.snowplowanalytics.iglu.client.Client$.parseDefault(Client.scala:58)

Hmm, no good.

Similar error when I typed the schema name wrong for a snowplow provided schema I added for testing’s sake (ad_click if you are wondering)

Lets take a look at my iglu_resolver just in case (not sure its relevant but…)

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      },
      {
        "name": "Custom Iglu Server",
        "priority": 1,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "GCP.load.balancer.IP",
            "apikey": "APIKeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeey"
          }
        }
      }
    ]
  }
}

I have tried the uri with /api appended and it always gives the same error.
I have not been able to try :8080/api because of the illegal char error with decoding a colon

Your schema itself looks fine - this feels like an error from the mutator that it isn’t able to fetch the schema from your Iglu Server (which kind of makes sense if it’s just the GCP API). I believe that Iglu configuration files should be able to take colons without any issues e.g., http://load-balancer-ip:8080/api - it sounds like you’ve tried this but would be good to get the error message to determine if this is an issue.

As is I think just the load balancer IP is going to default to port 80 which as your Iglu Server is running over 8080 it’s probably going to error (though that error could be much more clear). If the colon thing doesn’t decode that should probably get fixed anyway, but I’d be tempted to just remap your load balancer frontend so that port 80 / 443 maps to port 8080 on your Iglu Server which will save the need for the colon itself.

1 Like

Hi @cole , just to confirm what Mike said, this error looks like an issue with fetching the schema. If you look closely at the error being thrown, it comes from this method call:

com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator$.com$snowplowanalytics$snowplow$storage$bigquery$mutator$Mutator$$fetchError(Mutator.scala:167)

ie fetchError as opposed to invalidSchema.

It’s a very unhelpful error message, but hopefully understanding what it says can help you get closer to a solution.

As you’ve discovered, one reason the schema might not be able to be fetched from the server is a typo, though it doesn’t look like it’s the case here. More likely it’s an issue when connecting to or authenticating with the Iglu server. Could it be something silly like not using https in the server uri? Or a missing port mapping, along the lines of what Mike has suggested?

1 Like

The colon decoding issue somehow fixed itself, but I still couldn’t get it to fetch the schema from my iglu server. I ended up using a static repo in GCS for my custom schemas. But this makes me wonder about the whole http vs https possibility.

@dilyan I didn’t think I had the option to setup the servers, or more specifically the load balancers, with https protocol in the GCP quickstart terraform.

And @mike, you are making me think I did with the reference to mapping port 443 of the load balancer to port 8080 of the iglu server.

The terraform script set both load balancers (collector and iglu) with just the http protocol, and this has been a thorn in my side with the collector since the sites that I am trying to track are https and the browsers are all refusing to send requests to the http://<collector URL>

Regarding the iglu server config, the iglu load balancer seems to have the correct backend service selected with the correct port. I would think that they are mapped accordingly:

caveat: I am data guy, not a web guy, and as alway, I am probably missing something obvious

Ahh I think the custom domain part is what I missed using the quickstart setup.
Custom Domain w/https Protocol Applied to the Load Balancer and a Google Managed Certificate.

2 Likes