It’ll only be the bad events in a payload that get dropped. If the payload that arrives at the collector contains 49 good and 1 bad the good should flow through the entire pipeline as normal.
If you’re running event recovery one thing to note is that as you’re reprocessing these original ‘raw’ payloads that contain a mix of good and bad you may end up with duplicate events (there’s a note at the end of the page on this in the caveats section).
Currently the error information we see isn’t really targeted towards payloads with multiple events, as there is no idea which event in the payload is the bad one, or even if it is the event payload or a context.
Is the current recommendation to run every bad event through a Jason scheme validation for its payload, event body and contexts to find where the issue is?
Also what would happen if a collector payload had errors in multiple events? Would that end up as a single bad row in the bad folder, or would it have an entry for each bad event?
So for now when reprocessing bad rows when using a POST with multiple events we need to be super careful.
Would having a process where we look at the run identifier, and pull out all event ids that successfully made it through to the destination and then excluding those from the bad process be a good idea?
We would likely run multiple bad processes over the top fixing different issues, so we would need to be able to pass multiple runs.
Or do you have any other suggestions on how to process bad rows that are in this state? For example we have a line that has 50 events, all 50 are bad, so now we actually have 50 bad lines that are exactly the same, so we have no way to easily break them apart.
What’s the ultimate target of the data - is it Redshift or Postgres? If it is, you can potentially lean on the RDB Shredder to do the dedupe for you “for free”, because with cross-batch dedupe enabled, it won’t load events that it has already processed.
That’s exactly what the DynamoDB-powered cross-batch dedupe in RDB Shredder does