Calling all Snowplow OS users: What does “data quality” mean to you?

Among the companies I worked for using Snowplow data, the main concern came from the daily availability of product metrics (traffic, conversions, revenue) as well as discrepancies with other data sources (eg. Google Analytics).

Daily metrics were subject to small but frequent errors in our modelling pipelines, and frontend tracking changes. We built simple dashboard with Redash or Grafana with basic thresholds and a reactive alerting system, checked every morning by the BI team.

As for discrepancies with Google Analytics, a good first step is to build a Google Spreadsheet (or similar report) to compare basic metrics (pages, sessions, users, conversions) across several dimensions (country, device, product). This enables to spot tracking changes early enough to correct them without impacting bigger reports. Keep in mind that tracking are defined differently across the two platform (see dedicated article)

For analysing/ monitoring data quality in depth, time series anomalies detection are frequently implemented. However, the cost of maintenance of them happens to be too high for a daily use.
In data-feedback systems (recommendation engines, etc.), basic daily input volume checks and user engagement metrics are setup to prevent absurd results (or hyper fitting)

3 Likes