Mutator Exception Adding Custom Schema via GKE Mutator pod

Hi Everyone, Have hosted all snowplow components in GCP GKE and I’ve been trying to add a new Schema via mutator.

I created a schema and submitted it. Then after I run:

./bin/snowplow-bigquery-mutator add-column \
--config $LOADER_CONFIG \
--resolver $RESOLVER \
-shred-property CONTEXTS \
--schema iglu:com.iab.snowplow/spiders_and_robots/jsonschema/1-0-0

Exception:

[ioapp-compute-0] INFO com.snowplowanalytics.snowplow.storage.bigquery.mutator.Main - load_tstamp already exists
com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator$MutatorError
        at com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator$.com$snowplowanalytics$snowplow$storage$bigquery$mutator$Mutator$$fetchError(Mutator.scala:167)
        at com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator.$anonfun$getSchema$1(Mutator.scala:105)
        at cats.data.EitherT.$anonfun$bimap$1(EitherT.scala:383)
        at timeout @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator.getSchema(Mutator.scala:105)
        at map @ com.snowplowanalytics.iglu.client.Client$.parseDefault(Client.scala:58)
        at flatMap @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator$.initialize(Mutator.scala:157)
        at flatMap @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator.getSchema(Mutator.scala:110)
        at flatMap @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator.addField(Mutator.scala:73)
        at map @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Main$.$anonfun$run$24(Main.scala:66)
        at delay @ org.typelevel.log4cats.slf4j.internal.Slf4jLoggerInternal$Slf4jLogger.$anonfun$info$4(Slf4jLoggerInternal.scala:91)
        at delay @ org.typelevel.log4cats.slf4j.internal.Slf4jLoggerInternal$Slf4jLogger.isInfoEnabled(Slf4jLoggerInternal.scala:66)
        at ifM$extension @ org.typelevel.log4cats.slf4j.internal.Slf4jLoggerInternal$Slf4jLogger.info(Slf4jLoggerInternal.scala:91)
        at apply @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.TableReference$BigQueryTable.getTable(TableReference.scala:39)
        at flatMap @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.TableReference$BigQueryTable.getTable(TableReference.scala:39)
        at map @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.TableReference$BigQueryTable.getFields(TableReference.scala:45)
        at flatMap @ com.snowplowanalytics.snowplow.storage.bigquery.mutator.Mutator.addField(Mutator.scala:84)
        at liftF @ com.snowplowanalytics.iglu.client.resolver.Resolver$.$anonfun$parse$3(Resolver.scala:269)
        at map @ com.snowplowanalytics.iglu.client.Client$.parseDefault(Client.scala:58)
        at map @ com.snowplowanalytics.iglu.client.Client$.parseDefault(Client.scala:58)

Resolver config is

{
    "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0",
    "data": {
      "cacheSize": 500,
      "repositories": [
        {
          "name": "ncg prod Iglu Repository",
          "priority": 10,
          "vendorPrefixes": [ "com.xyz" ],
          "connection": {
            "http": {
              "uri": "https://storage.googleapis.com/bucket-name"
            }
          }
        }
      ]
    }
  }

Tried running


./bin/snowplow-bigquery-mutator listen --config $LOADER_CONFIG --resolver $RESOLVER --verbose

which gave an output

[ioapp-compute-0] INFO com.snowplowanalytics.snowplow.storage.bigquery.mutator.Main - load_tstamp already exists
[ioapp-compute-0] INFO com.snowplowanalytics.snowplow.storage.bigquery.mutator.Main - Mutator is listening loader-types-sub PubSub subscription

Apologies, The schema URL that I provided to the mutator was wrong, After updating the schema URL I was able to add the columns successfully to the database.

But now even though the additional columns are in place no new events are flowing into the good_events table n Google Bigquery.

I checked the enricher pod logs, I could see that the new spider and bots enrichment was downloaded and enrich environment was initialized successfully.

Not sure if I am missing any other configuration updates to incorporate this. @mike

If you are seeing events being successfully enriched (to the good pubsub topic) but not appearing in BigQuery the next thing to check is the failed inserts topic (from the BQ loader) to determine if it is failing to insert the records for some reason.

@mike I pulled the below message from enriched good pub/sub subscription.

"data":[{"schema":"iglu:com.iab.snowplow/spiders_and_robots/jsonschema/1-0-0","data":{"spiderOrRobot":false,"category":"BROWSER","reason":"PASSED_ALL","primaryImpact":"NONE"}}]}	18328200-550f-466a-85d5-d7b527085092	2022-07-04 12:43:07.690	com.snowplowanalytics.snowplow	page_view	jsonschema	1-0-0	b6b6d362d4902e7652acdbbc687691ab

I have just pasted a partial message above, just to show that the events seem to be enriched with spiders_and_robots schema and the event also comes in BigQuery good_events, but its missing values for spiders_and_robots schema, rest all the event related values can been seen in good_events table.

I had manually created the additional column via mutator pod using the spiders_and_robots json schema via the below command and the new columns were reflected in bigQuery as well but no data as per the spiders_and_robots schema.

./bin/snowplow-bigquery-mutator add-column --config $LOADER_CONFIG --resolver $RESOLVER --shred-property CONTEXTS --schema iglu:com.iab.snowplow/spiders_and_robots/jsonschema/1-0-0

I hope using mutator to add-column via a pod is the right way? Not sure if I missed anything here.

This should be fine. Are you seeing the rows in your BigQuery table without the column or are you not seeing the rows at all? The BigQuery loader shouldn’t partially load rows - so it’s worth checking the failed inserts PubSub topic in your loader configuration.

@mike After making the spider bots DB schema changes
from

"additionalProperties": false

to

"additionalProperties": true

I can see events flowing in with values for spider and bots columns, which seems great.
But I can see a raise in enriched-bad-subuscription event count possibly related to some other schema restrictions.