Redshift loading error: null byte - field longer than 1 byte

danisola · January 11, 2017, 4:24pm

Hi Snowplowers!

Today we stumbled upon this issue. Basically a client-side, self-describing event was fired with a field that contained the Unicode character null (\u0000).

To unblock the storage loader, we manually removed the character from the event and rerun it. It’s only the second time this has happened to us (last time was a year ago or so), but it could be quite distracting if someone started sending them on purpose.

We considered how to fix the issue properly and we came up with some options:

Force the tracker users to remove those chars before sending the event. It’s not ideal because the sanitizing code will be spread in many places.
Remove those chars in the tracker. We don’t like it much either, because it would require changing a lot of trackers.
Use the event schemas to invalidate events that contain this character. We don’t like it much because it would complicate most of the schemas. The whole event would also be discarded, which is not ideal.
Add some sanitizing code in scala-common-enrich that removes null characters for all self-describing events. Something like event.unstruct_event = sanitizeString(event.unstruct_event).

Of the alternatives we prefer the 4th one, because centralizes the logic in one place and doesn’t force the tracker or its users to deal with a DB-specific issue. On the other hand, the solution is quite blunt.

What do you think? Maybe you have other options?

Thanks,
Dani

alex · January 11, 2017, 4:43pm

Thanks for sharing @danisola - and for tracking down the exact issue. What do Snowplow users reckon about the various options?

mike · January 11, 2017, 9:31pm

It’s funny you mention this as I ran into a similar issue this week - though instead of a few odd unicode characters it was a large number of them causing a load to fail. Unfortunately MAXERROR wouldn’t work so the only way was to remove the offending lines from the file.

I think you make a good argument for option 4 as keeping the logic centralised and therefore consistent across enrichers rather than the numerous trackers makes a lot of sense.

danisola · January 18, 2017, 5:15pm

What do you think, should I create a PR with the 4t option @alex?

alex · January 18, 2017, 7:13pm

Hey @danisola - just thinking about this some more - as you said,

doesn’t force the tracker or its users to deal with a DB-specific issue

If it’s a DB-specific issue, we should probably add the sanitization code into Hadoop Shred, which is the DB-specific component, not Common Enrich, which is meant to be database-agnostic.

danisola · January 19, 2017, 8:53am

Makes sense, I will open a PR soon.

Thanks everyone!

alex · January 19, 2017, 5:34pm

Thanks @danisola!

alex · November 9, 2017, 2:51pm

Hi @danisola - did you manage to get round to this? I’ve created a ticket here:

danisola · November 14, 2017, 9:16am

Hey @alex,
Sorry for the late reply, I was on holiday. To get around the issue I did a quick fix (hack?) in the shredder code that I wasn’t very proud of, here it is in case you want to use it:

def sanitizeUnstructEvent(event: EnrichedEvent) = {
  def sanitizeString(value: String): String =
    if (value != null) value.replace("\\u0000", "")
    else value

  event.unstruct_event = sanitizeString(event.unstruct_event)
  event
}

alex · November 14, 2017, 8:22pm

Thanks @danisola! I’ve updated the GH issue.

Topic		Replies	Views
Storage loader error For engineers	10	2406	October 17, 2016
Serializable isolation violation on table Troubleshooting	9	3719	September 29, 2017
Step [rdb_load] stdout: Configuration error Attempt to decode value on failed cursor: DownField(sslMode) Troubleshooting	10	2292	November 6, 2019
Configuring RowDecodingError behaviour for RDB Loader For engineers	3	672	November 15, 2022
[IMPORTANT] November 7, 2022: Redshift table migration for yauaa_context 1-0-4 Open Source Alerts	2	935	November 10, 2022

Redshift loading error: null byte - field longer than 1 byte

Related topics