Clickstream data datawarehousing guide

I wrote a sizeable piece on clickstream data use cases, ownership and available options - paid, free, opensource. Any feedback would be very welcome.

A guide to data warehousing clickstream data

2 Likes

Wow, that’s a big post @evaldas - it’s past 23:00 here, so going to have to read after some shut-eye. For now, I skipped to the experiments side of things because that’s our specialty.

It frustrates me that SaaS testing tools (Optimizely, VWO etc) obscure so much of the useful data away from analysts at the expense of simplifying their products. But when you track experiments into your own DW, you open lots of great benefits up:

  • Assigning treatments to different units (do we only want to split traffic at the user level? How about splitting traffic at the product level or session level?) - you only have that freedom in tools like Snowplow
  • Total flexibility with the metrics you can measure & reports you can generate (your job is no longer defined by the capabilities of your SaaS testing vendor)
  • Data in your own DW may be more trustworthy than the figures out of a SaaS product
  • Better support for filtering out bots and other confounding data in your experiment

Companies invest tons of money into test development and tools - why not analyse these rich experiments with the most complete dataset available?

@robkingston thanks for your great feedback. I share your thoughts in regards to A/B experiments. I have tried VWO in the past, if I remember they have fairly comprehensive UI and feature set, but when it comes to data as you said it was locked in their platform. So before as a workaround I’ve just added an extra context which was mapping to VWO js experiment var data. Though eventually we have stopped using it and just stuck to tracking experiments ourselves as it gives all the flexibility points as you mentioned especially joining to other datasets like sales which from what I know is hard to do on VWO if you don’t send them as a separate metric.

Also with SaaS tools, it’s harder to test something thats not in UI, as it require, as it requires calling their api on a backend to know which variation should be served. We have avoided that by combining this lib - https://github.com/Glassdoor/planout4j - with snowplow tracker. This minimizes the performance penalty on each call.

When your splitting traffic by something else than a user do you do that on the split logic or you manage to infer this somehow when analyzing in DW? Normally I guess you would need to define an experiment with different split rule.

Really great write-up @evaldas, thanks for sharing!

1 Like

Splitting traffic at anything other than the user-level really needs to be done at the point of assignment.

True, you can always assign at the user-level and analyse your data at the session level, but then you’ve added session counts as a confounding factor.

Thanks for the tip about Planout - looks awesome! Is this what you use for app testing or server-side testing?

Right I guess it depends on the test. The user splitting is so predominant for experimentation that I actually never even thought about alternatives.

No problem, yes I have been using it for running tests on server-side. One useful thing that it supports it’s own scripting language which can be used to update tests on runtime without code changes. I have done some updates to make it work with snowplow, but haven’t had the time to open source it as a module project, though to be frank its fairly easy to extend it anyway.