Hi Nir,
First of all – welcome to our forum!
Is there a specific reason not to join to atomic.events
? Both tables should have the same DISTKEY (event/root ID) and SORTKEY (collector/root timestamp) so joining them is actually quite fast. You might need to enforce uniqueness on the event ID however, something neither Redshift nor Snowplow do at the moment (background & tutorial).
To answer your specific question, it depends on what you want to track. With Snowplow, you can define your own events and contexts (we call them self-describing or unstructured events - is this a process you are familiar with?). If you have the information available in the tracker, then you can send it that way, and it will end up in the custom table.
A good example is the user ID. You could either use setUserId
(in which case the user ID will end up in user_id
in atomic.events
) or you could define your own event or context, and track the user ID by sending the data as a self-describing JSON (in which case there will be a user_id
field in that custom event/context table).
That said, it’s a little more difficult with geo_location
because that field is calculated during enrichment (one of the pipeline stages), by looking up the IP address against a location database. If you want, you can write an enrichment that takes certain values, transforms, and adds them to a derived context. An alternative is to use SQL Runner to run a set of SQL queries that do the join once each time new events are loaded into Redshift, so you don’t need to run it every time you consume the data.
Does that answer your question? Don’t hesitate to ask for follow-up if you have any questions about specific features I mentioned.
Christophe