We have a requirement to not track data from certain countries. Right now we’re removing the data at the end of the pipeline (deleting from Redshift based on queries), but ideally we wouldn’t let these events get this far along in the pipeline.
Is it possible to drop events earlier? During enrichment, we use the Maxmind GeoIP enrichment so that would be one stage.
Note though that this approach will result in events going into the Failed Events bucket. I don’t think there is currently a way to discard them completely.
One option is to detect the location client-side, and just never track it in the first place.
Obviously it’s not always possible, but has worked well for others in the past.
This is a good approach depending on where you want to remove this data. It’s also not uncommon to do this at the load balancer / WAF / CDN level where you restrict what countries can access your load balancer - in that sense it then becomes impossible for you to collect this data.
All good ideas, thanks. We will consider removing data within Redshift, but my next question is can RDB loader emit an event to another process (webhook or similar) after it performs a load into Redshift? We would use that to trigger event deletion.
I saw the folder monitoring support in RDB loader, but for our use case we need a webhook called (ideally) after each “RDB load”
Hi @pt-mike, yes the loader does have a webhook for this. It is configured in this section of the config file. If you configure this webhook then you get two types of messages:
- A load_succeeded message, conforming to this schema whenever a batch is successfully loaded.
- An alert message, confirming to this schema whenever a batch fails, or for various other exceptions.