You can read what happened in the post below. Our IGLU server was unavailable for about a week and all the unstructured events that couldn’t be validated by IGLU didn’t make it to redshift but the structured events were fine and loaded. Any advice on the best way to rerun data for last 7 days without causing duplicates in redshift? Would having de-duplication turned on in enrichment cause only rows that are missing in redshift to be loaded? Other option is just to re-process and then run your de-duplication SQL that i saw in another post as well?
i don’t think de-duplication in enrichment would work since we’ve never had it on and it only works for batch runs right? and doesn’t look at what’s in redshift vs what’s being run in ETL etc. we’d have to run all the logs since 6/14 as one big batch for it to work? and even then we’d still have dupes in redshift from previous good loads.
Because you are only using GETs, you won’t encounter the duplication problem that can occur when recovering bad events from POST payloads (see the caveats section) in the documentation.