Data modeling is an essential step in the Snowplow data pipeline. We find that those companies that are most successful at using Snowplow data are those that actively develop their event data models: progressively pushing more and more Snowplow data throughout their organizations so that marketers, product managers, merchandising and editorial teams can use the data to inform and drive decision making.
What is event data modeling?
Event data modeling is the process of using business logic to aggregate over event-level data to produce ‘modeled’ data that is simpler for querying.
Let’s pick out the different elements packed into the above definition:
a. Business Logic
The event stream that Snowplow delivers is an unopinionated data sets. When we record a page view event, for example, we aim to record it as faithfully as possible:
- What was the URL that was viewed?
- What was the title?
- Is there any metadata about the page contents that we can capture?
- What was the cookie ID of the user that loaded the web page?
- On what browser was the user?
- On what device?
- On what operating system?
All of the above data points would be recorded with the event. None of the above data points are contentious: there is nothing that would happen in the future that would change the values that we’d assign those dimensions. For clarity, we call this data ‘atomic’ data. It is event-level and it is unopinionated.
When we do event data modeling, we use business logic to add meaning to the atomic data. We might look at the data and decide that the page view recorded above was the first page in a new session, or the first step in a purchase funnel. We might infer from the cookie ID recorded to who the actual user is. We might look at the data point in the context of other data points recorded with the same cookie ID, and infer an intention on the part of the user (e.g. that she was searching for a particular product) or infer something more general about the user (e.g. that she has an interest in French literature).
These inferences are made based on an understanding of the business and product. That understanding is something that continually evolves. As we change our business logic, we change our data models. That means that the modeled data that is produced by the process of event data modeling is mutable: it is always possible that some new data coming in, or an update to our business logic, will change the way we understand a particular event that occurred in the past. This is in stark contract to the event stream that is the input of the data modeling process: this is an immutable record of what has happened. The immutable record will grow over time as we record new events, but the events that have already been recorded will not change, because the different data points that are captured with each event are not contentious. It is only the way that we interpret them that might, and this will only impact the modeled data, not the atomic data.
We therefore have two different data sets, both of which represent “what has happened”:
- Atomic data: unopinionated, immutable
- Modeled data: opininiated, mutable
b. Aggregations
When we’re doing event data modeling, we’re typically aggregating over our event level data. Whereas each line of event-level data represents a single event, each line of modeled data represents a higher order entity e.g. a workflow or a session, that is itself composed of a sequence of events. We’ll give concrete examples of these higher order entities below.
Note that this is not always the case: sometimes we may want our modeled data to be event level. In that case the modeled data will look like the atomic data, but have additional fields that describe those infered.
1c. Simple to query
The point of data modeling is to produce a data set that is easy for different data consumers to work with. Typically, this data will be socialized across the business using a business intelligence tool. That puts particular requirements on the structure of the modeled data: namely that it is in a format suitable for slicing and dicing different dimensions and measures against one another. In general, atomic data is not suitable for ingesting directly in a business intelligence tool. (This is only possible where those tools support doing the data modeling internally.)