Configuring RowDecodingError behaviour for RDB Loader

Hello everyone,

We have recently switched from the Spark-based Snowflake Loader to the RDB Loader for loading events into Snowflake. I have a question regarding the RDB Stream Transformer’s behaviour when processing rows which contain a value that is longer than the defined size for a field.

The RDB Loader writes these rows into an output=bad sub-directory that contains something along the lines of:

{
   "schema":"iglu:com.snowplowanalytics.snowplow.badrows/loader_parsing_error/jsonschema/2-0-0",
   "data":{
      "processor":{
         "artifact":"snowplow-transformer-kinesis",
         "version":"4.2.1"
      },
      "failure":{
         "type":"RowDecodingError",
         "errors":[
            {
               "type":"InvalidValue",
               "key":"page_referrer",
               "value”:”https://some-really-long-url”,
               "message":"Field page_referrer longer than maximum allowed size 4096"
            }
         ]
      },
      "payload”:”xxx”   }
}

However, the Snowflake Loader simply shortens the field in question to whatever the maximum allowed size for that field is and writes it to Snowflake regardless.
My question is whether this behaviour is in any way configurable as we would prefer to still write this data (albeit it a broken form) to our DB rather than not loading it at all.

Thank you in advance!

Hi @eeno,

It’s good to hear you’ve switched over to the RDB loader, because this is where our development effort is going to go from now on, rather than the old loader.

I’m afraid it is not possible to configure the transformer/loader to trim fields if they are too long. This was a deliberate design decision at the time of writing the RDB loader: with the old loader we never liked how data could be trimmed silently, without any warning that some data had been mutated. We figured it was better to generate a failed event so that at least there is visibility over the state of the original data.

I’m curious, what is a typical size of the page_referrer field for you? I am open to re-visit the maximum allowed lengths. However, the maximum cutoff must be fixed somewhere, and sooner or later some events will always exceed the maximum size that we choose.

I understand that you want your data to be loaded instead of being failed. My best suggestion is to use the javascript enrichment so you can choose for yourself how to handle over-sized fields. For example, you might choose to trim the page_referrer field or you might choose to set it to null. Here is an example javascript enrichment configuration:

function process(event) {
  const referrer = event.getPage_referrer();
  if (referrer && referrer.length > 4096) {
    // you choose what to do!
  }
  event.setPage_referrer(referrer);
  return [];
}

I know this is extra work for you to set this up. But I think it is better that pipeline operators can opt-in to trimming/removing just the fields they want, rather than have the loader mutate fields without warning. I would be interested, though, to get your opinion on this.

This is a good solution - particularly if you can add it to a context as you’ll get far better performance out of Snowflake vs parsing the URL and extracting components out that way.

Given this is a referrer URL I’d also be tempted to see if you can parse this before sending the event and adding it to a context - that way you’ll offload the minimal processing to the client and you won’t necessarily require the Javascript enrichment if you can truncate and filter this early.

As Mike suggests, you could do this with the JS tracker if seeing long referrer urls is something that you typically see.

const getPath = (url) => {
  return url.split("?")[0];
}

const referrer = document.referrer;

if (referrer.length > 4096) {
  const shortReferrer = getPath(referer);
  window.snowplow('setReferrerUrl', shortReferrer);
}