We have a homegrown AB test script that selects users into variants for experiments running on the site. Right now, we:
Create a custom context to track of the the experiment/variant groups a user has been selected into and attach that to trackPageview such that all page-view and page-ping events contain that context
We have a set of modeled tables that get built to do all business reporting off of based on example SQL files provided by @christophe - One of the modeled tables maps the web_page_id to the experiment/variant selections for the user on a page view. All other business reporting tables can join to this if we want to segment by AB test groups.
The issue that we are having is that even with a moderate amount of data, the data modeling step that builds the temp tables to track test variants takes the longest to run (over a half hour) and we’ve noticed that the size of that context table scales faster than even the atomic events table (i.e. for 54 million events in the events table, there are currently about 86 million rows just in that custom context table.
Which makes me worried about how long the modeling scripts will take to run as our data scales up (this is only about a months worth of data so far). Given that for our data modeling we really only need to be able to join the web_page_id (from the web page context) to the AB test selections, it seems wasteful to have to store those selections for each page ping as well. Is there a recommended alternative way to do this. I’m wondering if I can have snowplow only fire that context on the initial pave-view event and not for all subsequent page view events. Or if I should model a separate event type altogether that captures the user AB test variant selection but only fires once per page view? Or are there other suggestions for how to handle this?
We ran into the exact same problem. Your approach of attaching A/Bs as custom contexts to page views looks good to me - the issue you’re seeing with the data modeling step is caused by rebuilding the entire table on each run.
Christophe has a great tutorial on how to make these models incremental. Using this, we were able to reduce our web-experiment model runtime from 1.5h to 5mins.
We also run our split testing tool using Snowplow as storage but we calculate which events are participating in a test during analysis. I.e. not by contexts attached client-side.
1. On the client side
Each time a user encounters a test on a page, or some other dynamic event when someone becomes part of a test, we fire a “variant exposure” event.
2. On the server side
When we analyse the data, we query the exposure events and look at
First exposure times and first conversion times by user (where the conversion event happened after exposure time). Credit to this approach goes to @yali who suggested to me years ago before contexts existed - I think he’d recommend contexts these days though.
Performance wise we’re pretty happy with it. We never use data modelling on our events because a lot of our experiments tend to be bespoke and results for queries often come back in 5-10 seconds. We currently have a test with >2M users exposed… The query for that test took 35 seconds this evening - no data modelling.
Downsides to calculating server side
- Harder to model
- Can take up more space in Redshift if pages/visit are small
Benefits of server-side calculation
- No need to trust split test assignment cookies… or sharing those across devices and platforms
- While the initial exposure to a test is written as a separate event to atomic.events, we’re not generating 3-4 exposure contexts upon each page view / event
- Forces you to consider the user key for the test (should we use a first/third party cookie? do we need to track users across devices? user stitching?)
- Events that take place on another system/device are accessible by default
Thanks @bernardosrulzon ! looking at the incremental modeling article now
However, we do stop experiments and start new ones and have users that will return to the site over long periods of time. When new experiments starts user selection is reset. When comparing metrics for different variants, we would want our reporting only to reflect the events that were actually exposed to the variant and not drag in events for users that were once selected into the variant but are no longer in it because the experiment ended.
Do you have this same issue in your tests and if so how do you handle it with snowplow events?
Data modeling is the best choice for us not just for performance but we will have have other users in the company reading and writing queries and using analytics tools (tableau, etc) and want a more logical and standard schema to pull from (i.e. not requiring understanding the gritty details of the interplay of tables in the atomic schema.
I don’t suppose its possible to have a context that only fires on the
page-view event and not subsequent
page-ping events is there?