In this post I’ll outline three possible models of modelling Snowplow data in BigQuery - along with the pros and cons associated with each approach. This guide is by no means comprehensive so feel free to jump in with corrections, recommendations or queries.
Considerations
- Ease of use
- Performance
- Pricing
- Features
Ease of use
The data that is inserted into BigQuery should be easy to model and reason about. This means avoiding having to write complex syntax or user defined functions to perform common tasks in BigQuery.
Performance
Data should be modelled in a way that takes advantage of the way that BigQuery (and the underlying filesystem Colossus) distributes and partitions data as well as the query engine (Dremel).
Pricing
BigQuery offers a fundamentally different pricing model which is a summation of total size of uncompressed data in BigQuery and bytes processed (uncompressed) in queries per month. Additional costs may be incurred for streaming workloads. Although a fixed pricing model exists for BigQuery this is unlikely to be suitable for a large number of Snowplow users (fixed pricing starts from $40k/month).
Features
While BigQuery offers some familiar database concepts such as partitioning tables and decoupled compute/storage requirements other features such as nested/repeated structures are less common. Where possible Snowplow data should exploit the features unique to BigQuery to improve flexibility, performance and cost.
Approaches
1. “Naive” approach
2. Shredded table approach
3. Shredded column approach
1. “Naive” approach
This approach is the least effort to implement from a technical cost in terms of sinking data into Redshift, however it moves that cost downstream to analysis time.
The idea of this approach is to convert the TSV+JSON mixed format into a JSON document which is inserted directly into BigQuery. This does not entail any shredding so the contexts
, derived_contexts
and unstruct_event
fields will contain nested data.
How are schema changes handled?
Any new schemas or modifications to schemas would be required to patch the existing columns.
Pros:
- This approach involves little effort in implementing as data could be inserted from PubSub with minimal transformation
- All data is contained within a single table reducing the requirement for joins between tables
Cons:
- Deeply nested data (particularly arrays) become difficult to query easily.
- Columns contain nested data require reasonably complex defined schemas to account for possible self-describing JSON events that may occur in these columns
- Selecting a small amount of data from a larger JSON object is more economically costly due to the BigQuery pricing being based on bytes processed
2. Shredded table approach
The shredded table approach closely mimics the current functionality when sinking Snowplow data into Redshift. One wide table is used (events) along with a number of additional tables for each context, 1 per schema per model version. Joins occur on a timestamp (collector_tstamp/root_tstamp
) and an event identifier (event_id/root_id
).
How are schema changes handled?
Schema changes would be largely identical to the current process with Redshift.
- New schemas require creating a new table
- Additions require adding columns to the shredded table
- Revisions require patching existing columns in the shredded table (or possibly adding new columns)
- Models require the creation of a new model table (e.g.,
_1
,_2
)
Pros:
- This approach would maintain close query compatibility to the current Redshift shredding model
- Queries and workloads that currently run on Redshift could be made to run on BigQuery with minimal intervention
- There is a marginal increase in bytes processed and storage as the join keys (
collector_tstamp
&event_id
) are stored in each table
Cons:
- More complicated queries that join shredded tables may require additional syntax if shredded tables are also partitioned by day
- Joins are likely to be less performance than colocalisation shredded data in the same location (note: this requires benchmarking)
3. Shredded column approach
The shredded column approach is to retain a single table (events) that contains one column per self-describing JSON per model. If a schema had two model versions in approach 2 this would involve two tables example_1
and example_2
- in the current approach this would instead be two columns example_1
and example_2
in the events table.
How are schema changed handled?
- New schemas required patching an existing table by adding a new column
- Additions require patching an existing column
- Revisions require patching an existing column
- Models require patching the table to create an additional new column per model (e.g., _1, _2).
Pros:
- Data is colocalised in a single table for speed of access
- No duplication of data
- No joins between tables required
Cons:
- In certain queries more bytes may be processed as data is nested when compared to approach 2