I am using Snowplow Version 75 . During Enrichment - Huge no of Raw Logs are going into the Bad logs as the Netporter Library is not able to parse the Event : Illegal character in fragment at a particular index. Could you please suggest me ways so that I can somehow prevent the loss of data .
These rows fail because the URI contains an illegal character. The best solution to prevent data loss would be to update the actual URI’s and remove the illegal character.
In some situations, fixing the actual URIs is not possible. We have a number of users who track events across their client’s websites rather than their own. (Ad networks are an obvious example.) For these users changing the URI is not practical.
In the mid term our intention is to break out URI parsing into its own enrichment. It will then be possible to configure this to return empty values for e.g. the different page_ fields, rather than invalidating the event as a whole.
In the meantime we’ve seen some users implement some Javascript in the tracker to identify if a URI looks problematic and manually set it to a safe (dummy) value if so using the setCustomUrl method documented here.
We’re also working on improving our technology around reprocessing bad rows, to make it straightforward to reprocess a batch of bad rows, and if they are bad because of specific reasons (identified using the error messages included in the bad rows), apply some transformations to the data to address the issue, so they can be safely reprocessed. More details to follow as this is built out.
@christophe, @yali, resurrecting a reasonably old thread as we are working around the issue ourselves for a large customer. We are taking both approaches (extract bad rows and fix invalid URLs for reprocessing AND improve our JavaScript tracker to override invalid URLs before they are collected), but nevertheless we’re interested in the last thinking and any features that can help with invalid URIs that we might not be aware of.