I’ve updated a schema from 1-0-0 to 1-0-1 and realized that this sent millions of events into enriched_bad. One example would be Googlebot page_views, where we get around 30.000 - 100.000 per day.
I setup the new schema to have minimalProperties set to 1, which seems to be the problem here, as the Googlebot uses some kind of cache and doesn’t send data to the 1-0-1 but still to the 1-0-0 schema. Even 5 days after my schema update most of the Googlebot events are only having 1-0-0 schema values.
There are also user events in between and Facebook inApp browser users, as FB seems to have some weird caching as well (We had numerous technical problems with it already).
I wonder how I could circumvent this for future schema updates? If I try to enforce strict schemas with set minProperties for example I will always end up with a lot of events in the bad bucket as far as I understand?
Is there some logic that could for example check in the enrichment process if there is data in 1-0-1 and if not check 1-0-0 before sending it to the bad bucket? I can see new problems coming up with that though. To name one would be how to handle null values in the new schema, if it doesn’t allow nulls in the database fields…
There’s not really anything you can do to prevent caching problems like this, since they’re in the client and completely out of your or the trackers’ control. However, the situation will remediate itself over time, and you’ll see less and less events for the old schema coming through. Incidentally, this is exactly why schemas must be immutable and follow schemaVer strictly.
In general, however, that’s not usually a massive problem, more of an inconvenience - you can think of it this way: if you have deployed your changes to the front end, then the majority of events arriving now with the old schema weren’t generated now. They were generated before your change, but failed to arrive successfully. So it normally doesn’t mean your change didn’t take place (as long as you’re actually serving the new page - for example if your CDN is caching heavily this might want to be addressed).
Is there some logic that could for example check in the enrichment process if there is data in 1-0-1 and if not check 1-0-0 before sending it to the bad bucket?
This isn’t currently possible, but also it’s a bit of a problematic one to consider trying to do. However, what you can do is recover the data using the event recovery tooling. If you wanted a more real-time solution, you could look to instrument a job which reads from the bad stream and carries out a similar task. (One would need to take care not to create an infinite feedback loop through enrich if attempting this).
PS. A digression - I don’t know much about it but isn’t Googlebot basically just Google’s web crawlers that they use for indexing? I would’ve thought those events wouldn’t generally be useful - in fact they’re generally filtered out of analysis normally.