A lot of our data is sent in to our database raw, and with multiple lines per entity. For example, when a user signs up, we send in a record (user ID, registration date, email etc.), and subsequently, when the user’s account changes (when they login, change any information etc.) we send in another row of information which sits in a a long table. In the data modelling stage we then ‘flatten’ the table to the most recent set of information as ordered by the root_tstamp.
Physically achieving this is done mostly by using the LAST_VALUE window function in Redshift, partitioned by the entity (user_id in this case). We build a prep table using a WITH statement (a Common Table Expression, I believe is the technical term…) and then select from that and group by the main entity identifier (for example, User ID).
This has been working well for us, however, on occasions where the raw table is wide (lots of columns of information) the query can become very long and difficult to follow. We’ve also come up against strange assert errors in Redshift using lots of these functions in a query, meaning the query doesn’t run. It also means that with long the GROUP BY statements needed to aggregate this information, the likelihood of a Cartesian Product starts becoming more and more likely.
Therefore, is there a best practice when it comes to aggregating, or ‘flattening’ a table to only return one line per entity, with the most recent information? Is using window functions the best way to go about this? Or are there other grouping and aggregating functions in Redshift that are better suited to this task?
A long question, I know - so any help is hugely appreciated!
Some thoughts about this, because we have a similar use-case, not sure how deep you’re already into that topic, sry if I suggest some obvious things:
what can you do against cartesian products?
- we have a data-model that removes bot traffic, that runs before every other data-model (bots almost always create duplicate event_ids)
- de-duplicate events where you can, have a look how many event_id duplicates you have and make sure to remove them in the events table and contexts as good as possible
- join on both event_id and collector_tstamp to mitigate the cartesian product
- I’m not sure how your query looks, but you should almost never use * like in SELECT * FROM events
- Does your data-model run is incremental? So that you only flatten the the newest data-load?
In the general the window function seems to be correct for that kind of data you have. But it’s hard to say with a specific use-case
Hi @jrpeck1989 .
If I understand your question correctly, you have a table with multiple rows per user; and you want to only select the row that has the latest data.
You can achieve this in two steps.
First, create a lookup table that contains the
root_id of the rows you want to keep. (Each new record is a new event, so should have a unique
last_value(t.root_id) OVER (PARTITION BY ev.user_id ORDER BY t.root_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM table_name AS t
JOIN atomic.events AS ev
ON t.root_id = ev.event_id and t.root_tstamp = ev.collector_tstamp
user_id is not already present in
table_name and you have to join on to
atomic.events to get it. Otherwise, the query is much simpler.
Now that you have the lookup table, you can select all or some columns from
table_name, limiting the results to only the latest
WHERE root_id IN (SELECT root_id FROM lookup_table)
Thanks Dilyan! I changed the query for the lookup table slightly to the following:
last_value(t.root_id) OVER (PARTITION BY t.user_id ORDER BY t.root_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING )
FROM table_name AS t
It now appears to work perfectly! Thanks!