The SQL Query Enrichment lets you perform dimension widening on a Snowplow event via a SQL query against a JDBC-capable relational database.
This is an introductory tutorial to help you get started with this enrichment. Our example use case is deliberately simple but it should be enough to showcase the usage of the enrichment and give you an idea on how you could utilize it in real-world scenarios.
Business goal
Let’s assume that we have a web form containing an email address field. Updates to the form, including form submissions, all result in an event being sent to Snowplow.
We would like to enrich these events with some details related to the owner of the email address; you could say that we want to dimension widen the event using the email address as a key. We are in luck - our company operates an in-house CRM system which stores data in its own MySQL database.
Let’s start by understanding this data we want to join.
Part 1: Understanding our data for joining
Our company’s database has a table called users
which contains records for almost all relevant email addresses. This is the query that was used to create the users
table in MySQL:
CREATE TABLE users (
id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
firstname VARCHAR(30) NOT NULL,
lastname VARCHAR(30) NOT NULL,
email VARCHAR(50) UNIQUE,
reg_date TIMESTAMP
);
The SQL Query Enrichment is quite powerful - it can support retrieving data from a join across multiple tables. In our case, however, all the data we are after is contained within this single table - here’s a row:
SELECT firstname, lastname, email, reg_date
FROM users
WHERE email = 'pupkin@mail.ru';
Returning:
+-----------+----------+----------------+---------------------+
| firstname | lastname | email | reg_date |
+-----------+----------+----------------+---------------------+
| Ivan | Pupkin | pupkin@mail.ru | 2003-05-12 04:34:11 |
+-----------+----------+----------------+---------------------+
1 row in set (0.00 sec)
So, if our web form provides us with the user’s email address, we can use this email address as the key (email
) to retrieve data from this MySQL users
table, and attach this data to our events.
Part 2: Generating our source events
In an ideal world, perhaps our form submissions would fire a custom self-describing (unstructured) event, containing the email address in a dedicated field.
To keep things simple for this tutorial, however, let’s take advantage of existing Snowplow JavaScript Tracker functionality, specifically Form Tracking.
Unfortunately, we can’t use Form Tracking’s submit_form
event, because the data obtained from all of the form’s fields are populated into a single value
field. This is not something we can work with yet.
Instead, we’ll use the change_form
event, which is fired for every changed form field and populates the value of the changed field in its value
field, per the JSON schema for this event.
Note that if we had lots of fields in our form, we could trigger many fruitless lookups of non-email fields in our users
table; for the purposes of this tutorial let’s assume our field consists only of our email field.
We enable the form tracking on the web page like so:
<script type="text/javascript">
window.snowplow('newTracker', 'clj', 'pweb-clj.us-west-2.elasticbeanstalk.com', { // Initialise a tracker
appId: 'tom-tk'
});
window.snowplow('enableFormTracking');
</script>
This enables both submit_form
and change_form
event tracking. The change_form
event will end up in its dedicated table in Redshift post processing, called com_snowplowanalytics_snowplow_change_form_1
.
Great, we now know the data that we are looking up, and where our source events are coming from; next let’s put together a schema to store the data we are going to look up.
Part 3: Preparing our schema
Snowplow will pass change_form
events to the SQL Query Enrichment to look up the email address in each event against our users
table in MySQL. If a match is found, the matching record from MySQL will be converted into JSON and added as an additional context to the derived_contexts
field of the enriched event.
For this to go smoothly, we need to define the JSON Schema for our MySQL record. Basing this closely on the underlying users
MySQL table, here is our JSON Schema:
{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"description": "Schema for a Properweb user",
"self": {
"vendor": "ca.properweb",
"name": "user",
"format": "jsonschema",
"version": "1-0-0"
},
"type": "object",
"properties": {
"firstname": {
"type": "string"
},
"lastname": {
"type": "string"
},
"email": {
"type": "string"
},
"reg_date": {
"type": "string”,
"format": "date-time"
}
},
"minProperties":1,
"required": ["email"],
"additionalProperties": false
}
In our Iglu schema registry for Snowplow, this would be stored as the file:
schemas/ca.properweb/user/jsonschema/1-0-0
Now, we are going to use igluctl
to generate the corresponding Redshift and JSON Paths file to allow us to load these contexts into Redshift.
We run the following command:
$ igluctl static generate --with-json-paths schemas/
On successful run, the output is like the one below:
File [/Users/ihor/workshop/tutorial/sql-enrichment/./sql/ca.properweb/user_1.sql] was written successfully!
File [/Users/ihor/workshop/tutorial/sql-enrichment/./jsonpaths/ca.properweb/user_1.json] was written successfully!
Let’s take a look at the generated files. First, here is the Redshift DDL to create the table for our looked-up users:
$ cat sql/ca.properweb/user_1.sql
-- AUTO-GENERATED BY igluctl DO NOT EDIT
-- Generator: igluctl 0.1.0
-- Generated: 2016-07-06 15:28
CREATE SCHEMA IF NOT EXISTS atomic;
CREATE TABLE IF NOT EXISTS atomic.ca_properweb_user_1 (
"schema_vendor" VARCHAR(128) ENCODE RUNLENGTH NOT NULL,
"schema_name" VARCHAR(128) ENCODE RUNLENGTH NOT NULL,
"schema_format" VARCHAR(128) ENCODE RUNLENGTH NOT NULL,
"schema_version" VARCHAR(128) ENCODE RUNLENGTH NOT NULL,
"root_id" CHAR(36) ENCODE RAW NOT NULL,
"root_tstamp" TIMESTAMP ENCODE LZO NOT NULL,
"ref_root" VARCHAR(255) ENCODE RUNLENGTH NOT NULL,
"ref_tree" VARCHAR(1500) ENCODE RUNLENGTH NOT NULL,
"ref_parent" VARCHAR(255) ENCODE RUNLENGTH NOT NULL,
"email" VARCHAR(4096) ENCODE LZO NOT NULL,
"firstname" VARCHAR(4096) ENCODE LZO,
"lastname" VARCHAR(4096) ENCODE LZO,
"reg_date" TIMESTAMP ENCODE LZO,
FOREIGN KEY (root_id) REFERENCES atomic.events(event_id)
)
DISTSTYLE KEY
DISTKEY (root_id)
SORTKEY (root_tstamp);
COMMENT ON TABLE atomic.ca_properweb_user_1 IS 'iglu:ca.properweb/user/jsonschema/1-0-0';
Next, the JSON Paths file which we’ll use to load records into the above table:
$ cat jsonpaths/ca.properweb/user_1.json
{
"jsonpaths": [
"$.schema.vendor",
"$.schema.name",
"$.schema.format",
"$.schema.version",
"$.hierarchy.rootId",
"$.hierarchy.rootTstamp",
"$.hierarchy.refRoot",
"$.hierarchy.refTree",
"$.hierarchy.refParent",
"$.data.email",
"$.data.firstname",
"$.data.lastname",
"$.data.reg_date"
]
}
Okay great - we have prepared all the artefacts needed to join records from our MySQL users
table into our event stream.
We’re now ready to write the JSON configuration for our SQL Query Enrichment.
Part 4: Configuring our enrichment
Let’s call the configuration file sql_query_enrichment_config.json
, and make sure it is available in the folder of enrichments passed into Snowplow.
The wiki documentation for the SQL Query Enrichment gives a good overview of the various fields required - I’ll jump straight to the full configuration for this tutorial and then explain it below:
{
"schema": "iglu:com.snowplowanalytics.snowplow.enrichments/sql_query_enrichment_config/jsonschema/1-0-0",
"data": {
"name": "sql_query_enrichment_config",
"vendor": "com.snowplowanalytics.snowplow.enrichments",
"enabled": true,
"parameters": {
"inputs": [
{
"placeholder": 1,
"json": {
"field": "unstruct_event",
"schemaCriterion": "iglu:com.snowplowanalytics.snowplow/change_form/jsonschema/1-0-0",
"jsonPath": "$.value"
}
}
],
"database": {
"mysql": {
"host": "sqlenh.clk45ymgndgz.us-west-2.rds.amazonaws.com",
"port": 3306,
"sslMode": false,
"username": "ihor",
"password": "cherkas1",
"database": "sqlenh"
}
},
"query": {
"sql": "SELECT firstname, lastname, email, reg_date FROM users WHERE email = ? LIMIT 1"
},
"output": {
"expectedRows": "AT_MOST_ONE",
"json": {
"schema": "iglu:ca.properweb/user/jsonschema/1-0-0",
"describes": "EVERY_ROW",
"propertyNames": "AS_IS"
}
},
"cache": {
"size": 3000,
"ttl": 60
}
}
}
}
This file has to pass validation against the corresponding schema in Iglu Central. There are four sections to this configuration file:
-
inputs
specifying the datapoint from the Snowplow event to use as the key when performing the API lookup -
database
defining how the enrichment can access our company’s MySQL database -
query
defining how the enrichment should query our MySQL database -
output
allowing us to tune how we convert the returned row(s) into one or more self-describing JSONs ready to be attached to our Snowplow event
Let’s look at each in turn.
input
Each field within a Snowplow enriched event is either a pojo
if the datapoint comes from the Snowplow enriched event POJO, or json
if the datapoint comes from a self-describing JSON inside one of the three JSON fields, namely contexts
, unstruct_event
or derived_contexts
.
The SQL Query Enrichment can use any field from the enriched event can be used as a key to query the database, that is dimension widen with the SQL Query Enrichment.
In our scenario, the value
field of the change_form
event can be found in the unstruct_event
field of the enriched event. We can express this in our configuration using the following coordinates:
"inputs": [
{
"placeholder": 1,
"json": {
"field": "unstruct_event",
"schemaCriterion": "iglu:com.snowplowanalytics.snowplow/change_form/jsonschema/1-0-0",
"jsonPath": "$.value"
}
}
]
What we are saying here is:
- The type of the datapoint is
json
(as opposed topojo
) - The JSON is located in
unstruct_event
field of the Snowplow enriched event TSV file - The JSON is self-described by the JSON Schema located at
iglu:com.snowplowanalytics.snowplow/change_form/jsonschema/1-0-0
- The parameter we need can be accessed at the
$.value
JSON Path
The placeholder
parameter will be explained later on.
database
This section is self-explanatory. Our database is MySQL; you could equally try this out with PostgreSQL or Microsoft SQL Server.
"database": {
"mysql": {
"host": "sqlenh.clk45ymgndgz.us-west-2.rds.amazonaws.com",
"port": 3306,
"sslMode": false,
"username": {{USERNAME}},
"password": {{PASSWORD}},
"database": {{DATABASE}}
}
}
We strongly recommend that the username have minimal read-only permissions on just the entities required to execute the SQL query.
If your database server has additional authentication in place, such as IP whitelisting, you will need to configure this security to permit access from all of your servers running the Snowplow Enrichment process.
query
This is where the placeholder
parameter from the last section comes into play. The ?
symbol is a placeholder for the key(s) defined in the input
section. The query is implemented as a prepared statement: a precompiled query used to execute the same or similar database statements repeatedly with high efficiency. It takes the form of a template into which certain constant values are substituted during each execution.
"query": {
"sql": "SELECT firstname, lastname, email, reg_date FROM users WHERE email = ? LIMIT 1"
}
In our example, we have only one placeholder, but there could be more than one; you would then have additional entries in the inputs
array.
Some further notes on the behaviour of sql
:
- If a placeholder index required in the sql prepared statement is not found when searching the enriched event for the
input
s, then the query will not proceed, but this will not be flagged as a failure - A final
;
is optional - This enrichment makes no attempt to sanitize the SQL statement, nor to verify that the SQL statement does not have harmful side effects (such as SQL injection)
output
The enrichment adds the returned row(s) into the derived_contexts
field within a Snowplow enriched event. Because all JSONs in the derived_contexts
field must be self-describing JSONs, we use the schema field to specify the Iglu schema URI of the schema that describes the new contexts (which we showed you earlier in the section Phase 2: Schema preparation).
"output": {
"expectedRows": "AT_MOST_ONE",
"json": {
"schema": "iglu:ca.properweb/user/jsonschema/1-0-0",
"describes": "EVERY_ROW",
"propertyNames": "AS_IS"
}
}
The expectedRows
enum can take the following values:
-
EXACTLY_ONE
- exactly one row is expected. 0 or 2 or more rows will throw an error, causing the entire event to fail to process -
AT_MOST_ONE
- either one or zero rows is expected. 2 or more rows will throw an error -
AT_LEAST_ZERO
- between 0 and N rows are expected - i.e. we are dealing with an array of results -
AT_LEAST_ONE
- between 1 and N rows are expected - i.e. a non-empty array of results. 0 rows will throw an error
For our purposes, we use AT_MOST_ONE
because we are expecting to find either a single matching user in our MySQL users
table or no match.
The describes
property dictates whether the schema is the self-describing schema for all rows returned by the query, or whether the schema should be attached to each of the returned rows:
-
ALL_ROWS
means that the schema should encapsulate all returned rows - i.e. one context will always be added toderived_contexts
, regardless of how many rows that schema contains. Each returned row will be an array element within the self-describing JSON -
EVERY_ROW
means that the schema should be attached to each returned row - so e.g. if 3 rows are returned, 3 contexts with this same schema will be added toderived_contexts
In our case, the JSON Schema we created represents a single user, not an array of users, so we use EVERY_ROW
.
The propertyNames
property supports reformatting of the returned columns to fit the JSON Schema’s conventions better. Supported options are:
-
AS_IS
- preserve the column names exactly as they are -
CAMEL_CASE
- sodate_of_birth
becomesdateOfBirth
-
PASCAL_CASE
- sodate_of_birth
becomesDateOfBirth
-
SNAKE_CASE
- sodateOfBirth
becomesdate_of_birth
-
LOWER_CASE
- changes all characters to lowercase -
UPPER_CASE
- changes all characters to uppercase
If these options aren’t bespoke enough, remember that you can use column aliasing in your SQL statement to tweak individual column names.
For the purposes of our tutorial, the regular MySQL column names are fine.
##cache
A Snowplow enrichment can run many millions of time per hour, effectively launching a DoS attack on a data source if we are not careful! The cache
configuration attempts to minimize the number of lookups performed.
The cache is an LRU (Least Recently Used) cache, where less frequently accessed values are evicted to make space for new values. The enrichment uses the results of the query as the cache keys. You can configure the cache
as follows:
-
size
is the maximum number of entries to hold in the cache at any one time -
ttl
is the number of seconds that an entry can stay in the cache before it is forcibly evicted. This is useful to prevent stale values from being retrieved in the case that your API can return different values for the same key over time
Part 5: Running and testing our enrichment
Thanks for sticking with this tutorial so far - we are now ready to put our enrichment into action.
To recap, we are dimension widening our Snowplow enriched events by using the value
property in our change_form
self-describing events to look up rows from a users
table we are keeping in MySQL.
The email address expressed in the value
property in our change_form
will be used as the key to query the MySQL database during the Snowplow Enrichment process. The new data we want to obtain could be retrieved with the following SQL query:
SELECT firstname, lastname, reg_date
FROM users WHERE email = {{ACTUAL_EMAIL_ADDRESS}}
LIMIT 1
We do not expect more than one record to be returned because each email address should be unique within the users
table. The new data will be added as a self-describing JSON to the derived_contexts
array within our event.
We put our SQL Query Enrichment configuration live in our Snowplow installation and wait for a successful run (if batch) or for a few minutes to pass (if real-time).
Looking in Redshift, we are expecting to see the change_form
event in the atomic.events
table, the actual value for the captured email address in the atomic.com_snowplowanalytics_snowplow_change_form_1
table and finally the new (enriched) data obtained from our MySQL database in the atomic.ca_properweb_user_1
table.
We should be able to link the last 2 to the atomic_events
table by means of the event_id = root_id
relation:
SELECT e.collector_tstamp, f.value AS email, u.firstname, u.lastname, u.reg_date
FROM atomic.ca_properweb_user_1 AS u
INNER JOIN atomic.com_snowplowanalytics_snowplow_change_form_1 AS f
ON u.root_id = f.root_id
INNER JOIN atomic.events AS e
ON e.event_id = f.root_id
LIMIT 2
Here are my results:
collector_tstamp | email | firstname | lastname | reg_date
---------------------+--------------------+-----------+-----------+---------------------
2016-06-27 15:07:04 | ihortom@eircom.net | Ihor | Tomilenko | 2016-06-21 22:25:00
2016-06-27 17:12:33 | pupkin@mail.ru | Ivan | Pupkin | 2003-05-12 04:34:11
(2 rows)
And there we have it: the data from our MySQL database attached to our Snowplow enriched events.
We hope you enjoyed this tutorial on dimension widening your Snowplow events with your own proprietary data. Also, don’t forget to check out our companion tutorial:
Finally, as always, do please give this a go and share your experience and use cases with us in the comments.