Over at Snowplow HQ, the marketing team has been reflecting on the topic of data quality and we want to hear what you think.
We’re curious to find out how data informed companies like yours think about data quality, how you built trust around data, and your biggest data quality challenges.
If you’re open to having a 25-minute chat with Franciska and Lyuba from our marketing team and offer your insights, then please reply to this thread or send a message to marketing@snowplowanalytics.com.
As a thank you, we’ll send you a Snowplow t-shirt!
We deal with this problem every day. Here are some insights I have had over the last 2 years dealing with data quality as a data engineer.
Org Structure
- Conway’s law. To fix data quality, fix the org structure first. Instead of creating silos around data-producers, data-engineers and data-consumers, bring them together into teams split around business functionality.
- Make data-producers the owners of data and enforce schema rules. Data producers are rarely incentivized to publish good quality data and the problem is often pushed down to BI teams and data consumers.
- The focus of data quality management should shift from “fixing” issues to making them transparent to everyone. Invest in tooling for defining, generating and publishing meta-data for each data-set. Sadly, tooling in this space is sorely lacking.
- Data quality issues are best solved/managed in a tripartite setup where data-producers, data-engineers and data-consumers participate together.
- Poor data quality problems are highly demotivating for the data engineering teams. Consumers hold them responsible for all data quality issues and data producers often don’t care enough. Change this structure and make data-quality a collective concern.
ETL/Data-pipelines
- ETL logic must never compensate for poor data quality. This, IMO, is a huge anti-pattern that results in very complex ETL logic.
- If data-prep/data-cleansing is an unavoidable step in the ETL process, it must be applied using a decorator pattern instead of intertwining it with core ETL logic - even if this means slightly more inefficient ETL processes.
- Add audit columns to datasets (ex. updated_at, qa_at etc.)
Data
- Definition of Data quality varies from one consumer to the other. It’s difficult to agree on a common set of invariants.
- Datasets are interconnected. An issue in one dataset proliferates across all dependent datasets. It is important to be able to investigate and visualize this impact.
- Fixing data historically is orders of magnitude complex and is best avoided.
- Deduplication is an often-requested feature. Write ETLs that avoid “technical” duplicates (idempotency, immutability) . But avoid deduplicating “logical” duplicates. Instead add surrogate keys around logically-unique columns. These are cheaper by orders of magnitude.
- Add audit columns to datasets (ex. updated_at, qa_at etc.)
- Invest in an anomaly detection tool that can help identify problems of spikes and troughs in your datasets. These are not easily discoverable otherwise.
3 Likes
Thank you @rahulj51 - these are great insights and really helpful for us!
Your insights bring a lot of follow-up questions to mind - would you be willing to get on a call with us to discuss a bit further?
That’s great, thank you! Would you mind sending an email to marketing@snowplowanalytics.com with your time zone and I’ll send over a few time slots for next week.
Thanks again!
Among the companies I worked for using Snowplow data, the main concern came from the daily availability of product metrics (traffic, conversions, revenue) as well as discrepancies with other data sources (eg. Google Analytics).
Daily metrics were subject to small but frequent errors in our modelling pipelines, and frontend tracking changes. We built simple dashboard with Redash or Grafana with basic thresholds and a reactive alerting system, checked every morning by the BI team.
As for discrepancies with Google Analytics, a good first step is to build a Google Spreadsheet (or similar report) to compare basic metrics (pages, sessions, users, conversions) across several dimensions (country, device, product). This enables to spot tracking changes early enough to correct them without impacting bigger reports. Keep in mind that tracking are defined differently across the two platform (see dedicated article)
For analysing/ monitoring data quality in depth, time series anomalies detection are frequently implemented. However, the cost of maintenance of them happens to be too high for a daily use.
In data-feedback systems (recommendation engines, etc.), basic daily input volume checks and user engagement metrics are setup to prevent absurd results (or hyper fitting)
3 Likes