Hi there,
I would like to understand why the foreign keys in Redshift are not set to cascade?
When i delete events, I would like all relating data to be deleted as well.
Enrico
Hi there,
I would like to understand why the foreign keys in Redshift are not set to cascade?
When i delete events, I would like all relating data to be deleted as well.
Enrico
Foreign keys (along with some other constraints) aren’t enforced by Redshift.
Thanks. We reprocessed some logs and have to deal with duplicated events now.
Are there any good ways of cleaning this up? I was under the impression events that already exists will not be imported again.
Enrico
The guide that @christophe has written here is probably the most comprehensive docs on deduplication (both pre EMR and post EMR).
Further to @mike’s points - there is no CASCADE
in Redshift constraints:
http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
Hi I also want to ask about this and to see if there is any workaround?
My use case is different than estahn as I am just trying to delete events that have been loaded after a certain period of time. I think it would make sense to use cascade function so I can just delete some rows in atomic.events
table and let the event_id that corresponds to that deletion cascade to all the context tables (since these context tables use root_id
to refer the event_id
)
My plan for this is:
events
table based on etl_tstamp
root_tstamp
It will be tedious though especially if we have a lot of context tables.
@aditya there’s a difference between those tiestamps. root_tstamp
corresponds to collector_tstamp
. etl_tstamp
is the time that enrich processed the record.
The best option here is to get all the event_ids
from atomic.events
for the timeframe you want to delete, and then delete any entries in all tables with those event_ids
. You could write a script to get all table names for the atomic schema from PG_TABLE_DEF
and plug those into your queries.
Using timestamps is simpler, so if all you care about is expiring data that’s more than x date old, then use the collector_tstamp
for atomic and the root_tsamp
for all other tables.
Best,
Gotcha. I process records daily so thats why I wanted to delete records processed on certain day.
But you are right, we dont have etl_tstamp on context table. I ended up filtering using collector tstamp and root tstamp for events and context table.