Hey,
I have created a custom table and I am able to load data into Redshift.
In order to avoid joins with atomic.events
I would like to incorporate some fields(user_id
, geo_location
, user_id
, etc…) from atomic.events
to it.
Is this possible ?
Thanks,
Nir
Hi Nir,
First of all – welcome to our forum!
Is there a specific reason not to join to atomic.events
? Both tables should have the same DISTKEY (event/root ID) and SORTKEY (collector/root timestamp) so joining them is actually quite fast. You might need to enforce uniqueness on the event ID however, something neither Redshift nor Snowplow do at the moment (background & tutorial).
To answer your specific question, it depends on what you want to track. With Snowplow, you can define your own events and contexts (we call them self-describing or unstructured events - is this a process you are familiar with?). If you have the information available in the tracker, then you can send it that way, and it will end up in the custom table.
A good example is the user ID. You could either use setUserId
(in which case the user ID will end up in user_id
in atomic.events
) or you could define your own event or context, and track the user ID by sending the data as a self-describing JSON (in which case there will be a user_id
field in that custom event/context table).
That said, it’s a little more difficult with geo_location
because that field is calculated during enrichment (one of the pipeline stages), by looking up the IP address against a location database. If you want, you can write an enrichment that takes certain values, transforms, and adds them to a derived context. An alternative is to use SQL Runner to run a set of SQL queries that do the join once each time new events are loaded into Redshift, so you don’t need to run it every time you consume the data.
Does that answer your question? Don’t hesitate to ask for follow-up if you have any questions about specific features I mentioned.
Christophe
Thanks for the quick reply,
Yeah I have already created a custom self-describing JSON and was able to load events into Redshift using Snowplow to my own custom table.
Regarding the geo_location
, would it be possible to set up the same enrichment that will work on my custom event by using the same “looking up the IP address against a location database” ?
Also what about domain_userid
in which step is it created (client/enrichment) ?
Hi Nir,
The domain user ID is generated by the Javascript tracker and stored in a cookie. It should be possible to read the ID from that cookie: https://github.com/snowplow/snowplow/wiki/1-General-parameters-for-the-Javascript-tracker#cookie-name
It will be a bit more work to get the location from the IP address. The easiest solution would be to do it in SQL (do the join once, only on the events that were in the most recent batch, and insert the result into a separate table). The current geo IP enrichment cannot be set up to populate other fields, but you can write your own enrichment (e.g. API request enrichment, Javascript script enrichment, SQL query enrichment).
That said - I still recommend simply joining the table to atomic.events
. All this is creating additional complexity which shouldn’t be necessary.
Hope this helps,
Christophe
Could you provide an example of what a custom table for an unstructured event would look like? I haven’t been able to find much guidance in this area.
Also, how does the storage loader know which Redshift table to put data into?
1 Like
Hi @cdimitroulas,
You can find examples in Iglu Central: GitHub - snowplow/iglu-central: Contains all JSON Schemas, Avros and Thrifts for Iglu Central
Iglu Central (website) is a central repository for storing JSON Schemas, Avros and Thrifts, maintained and hosted by the team at Snowplow Analytics. Think of it like RubyGems.org or Maven Central but for storing data schemas instead of code.
For example: this is the JSON schema for a generic ad impression events we defined: ad_impression.
Two more files are created from the JSON schema:
You can use igluctl to auto-generate them from the schema. The table names follows straight from the schema:
- schema vendor
- schema name
- schema version
If you want to learn more about the role of JSONPaths, I recommend reading this excellent post by @ihor on the role they play in the loading process: JSONPaths files demystified
Hope this helps!