Today we stumbled upon this issue. Basically a client-side, self-describing event was fired with a field that contained the Unicode character null (\u0000).
To unblock the storage loader, we manually removed the character from the event and rerun it. It’s only the second time this has happened to us (last time was a year ago or so), but it could be quite distracting if someone started sending them on purpose.
We considered how to fix the issue properly and we came up with some options:
Force the tracker users to remove those chars before sending the event. It’s not ideal because the sanitizing code will be spread in many places.
Remove those chars in the tracker. We don’t like it much either, because it would require changing a lot of trackers.
Use the event schemas to invalidate events that contain this character. We don’t like it much because it would complicate most of the schemas. The whole event would also be discarded, which is not ideal.
Add some sanitizing code in scala-common-enrich that removes null characters for all self-describing events. Something like event.unstruct_event = sanitizeString(event.unstruct_event).
Of the alternatives we prefer the 4th one, because centralizes the logic in one place and doesn’t force the tracker or its users to deal with a DB-specific issue. On the other hand, the solution is quite blunt.
It’s funny you mention this as I ran into a similar issue this week - though instead of a few odd unicode characters it was a large number of them causing a load to fail. Unfortunately MAXERROR wouldn’t work so the only way was to remove the offending lines from the file.
I think you make a good argument for option 4 as keeping the logic centralised and therefore consistent across enrichers rather than the numerous trackers makes a lot of sense.
Hey @danisola - just thinking about this some more - as you said,
doesn’t force the tracker or its users to deal with a DB-specific issue
If it’s a DB-specific issue, we should probably add the sanitization code into Hadoop Shred, which is the DB-specific component, not Common Enrich, which is meant to be database-agnostic.
Hey @alex,
Sorry for the late reply, I was on holiday. To get around the issue I did a quick fix (hack?) in the shredder code that I wasn’t very proud of, here it is in case you want to use it: