As we are moving through the base Snowplow configuration we had some requirement from our BI team.
We basically want to load the entire Snowplow atomic.events to our own table or alternatively add some fields to the current atomic.events table.
I know I can load my own data to other tables and was successful in doing so.
However, our BI team does not want to constantly join with atomic.events and want to partition the data daily. We are basically very keen on using the enrichment features of snow plow and the entire pipe line but want to control the loading process and simplify it for our BI process.
You should be able to use sql runner for doing something like that.
I do not know the specific requirements for what your BI team is asking, but in theory you should be able to write a playbook to run an ETL process on the data in atomic and then send the output to another table, or even another schema, partitioned by day.
This way, you can keep atomic.events matching snowplow releases, and have another set of data that the BI team can use.
@13scoobie, I’m working in the BI department at Nir’s company.
Doing it through an SQL ETL process is very inefficient.
The correct way is to copy it directly to the correct table with the correct fields. It saves a lot of time.
We have very similar use case and sql works extremely well for us. We load nearly 100mm events every night. Immediately after populating atomic.events we run a set of jobs that create smaller tables with the data we care about and then archive the atomic data older than a few days.
We use Airflow to do all of that but should work with Sqlrunner just the same.
If that’s not good enough you could look into processing the enriched events over map reduce or spark or even plug in to Kinesis to digest events in real time.
Things like link_click table I totally understand. But, client_session table I can’t really understand.
This data is needed for every event, and it hurts the performance when joining every time.
The best thing would be to perform the enrichment and return the original data + results via json. When it’s through a json file, you can use redshift copy via json, and then it copies only the parameters which are in the table. And the order doesn’t matter then. Also, it would allow to add any free fields as we’d like.
Some users prefer to implement their own sessionization in their event data modeling phase
Some users are emitting events from server-side and other environments where the concept of a session doesn’t apply
Our long-term roadmap is to factor everything out of atomic.events, so it becomes eventually just a reference map to all the snapshotted entities for the given event. Long-term we can’t assume what entities will be “normal” for an event to contain.
If you know with approximate certainty what entities your events will always contain, it’s totally valid to rebuild your own “fat table” in Redshift prior to analysis - it will be more performant, at the cost of some flexibility.
If you know with approximate certainty what entities your events will always contain, it’s totally valid to rebuild your own “fat table” in Redshift prior to analysis - it will be more performant, at the cost of some flexibility.
We know the structure of the table we’d like to have, but we’d also like your great enrichment fields to go there. If it is possible, that would solve everything.