POSTed bad events, are they all dropped?

lookaflyingdonkey · September 18, 2017, 10:14pm

Looking to reprocess some bad events, and we are sending in batches of 50 from the collectors through a POST.

When there is 1 bad event in the batch, is the whole batch dropped?

I can see in the bad events the line includes them all, invalid or not, but just not sure if the ones that were good were actually passed on or not.

Cheers

mike · September 18, 2017, 10:39pm

It’ll only be the bad events in a payload that get dropped. If the payload that arrives at the collector contains 49 good and 1 bad the good should flow through the entire pipeline as normal.

If you’re running event recovery one thing to note is that as you’re reprocessing these original ‘raw’ payloads that contain a mix of good and bad you may end up with duplicate events (there’s a note at the end of the page on this in the caveats section).

lookaflyingdonkey · September 19, 2017, 12:47am

Thanks Mike, that’s what I suspected.

Currently the error information we see isn’t really targeted towards payloads with multiple events, as there is no idea which event in the payload is the bad one, or even if it is the event payload or a context.

Is the current recommendation to run every bad event through a Jason scheme validation for its payload, event body and contexts to find where the issue is?

Also what would happen if a collector payload had errors in multiple events? Would that end up as a single bad row in the bad folder, or would it have an entry for each bad event?

Cheers

alex · September 19, 2017, 9:05am

There would be an entry for each bad event.

That’s a fair point - we have some plans to evolve this in the future:

github.com/snowplow/snowplow

Bad rows should be one line of data per event

opened 02:53PM - 16 Feb 16 UTC

closed 02:50PM - 06 Jan 20 UTC

yalisassoon

Currently, if a user is batching events so there are e.g. 5 events per POST requ…ests, and one of those events fails validation, then it looks like all 5 events are stored in a single bad row, even if 4 out of 5 of those events was successfully processed. If this is the case, it's going to make reprocessing the bad rows tricky, because there's a risk the 4 successfully processed events will be double loaded. (This wont happen if deduplication is setup, but will happen otherwise.) Is there a way that we can return just a single event? Or is that risky? (Because it means processing a raw bad row rather than logging it as is.) If so - we may just be better double processing and relying on the dedupe step to prevent the doubly processed data making it through.

lookaflyingdonkey · September 19, 2017, 3:06pm

Thanks Alex!

So for now when reprocessing bad rows when using a POST with multiple events we need to be super careful.

Would having a process where we look at the run identifier, and pull out all event ids that successfully made it through to the destination and then excluding those from the bad process be a good idea?

We would likely run multiple bad processes over the top fixing different issues, so we would need to be able to pass multiple runs.

Or do you have any other suggestions on how to process bad rows that are in this state? For example we have a line that has 50 events, all 50 are bad, so now we actually have 50 bad lines that are exactly the same, so we have no way to easily break them apart.

Cheers,
Dean

alex · September 19, 2017, 8:05pm

No worries Dean,

What’s the ultimate target of the data - is it Redshift or Postgres? If it is, you can potentially lean on the RDB Shredder to do the dedupe for you “for free”, because with cross-batch dedupe enabled, it won’t load events that it has already processed.

That’s exactly what the DynamoDB-powered cross-batch dedupe in RDB Shredder does

Topic		Replies	Views
Event Recovery Reprocesses Good Events For engineers	1	608	February 20, 2019
Snowplow bad events reprocessing	7	1433	February 23, 2021
Bad event clubbed together with good events in bad s3 bucket Troubleshooting	1	1870	September 21, 2017
What happens when one event in a POST payload fails validation? Enrichment	1	1231	April 10, 2017
Snowplow bad rows and POST requests	2	1146	March 13, 2019

POSTed bad events, are they all dropped?

Related topics