We receive a considerable amount of questions related to JSONPaths files. Even though the topic is not Snowplow specific nonetheless it plays a significant part in ability to build extensible event model. Incorrect handling of JSONPaths leads to a few confusions resulting in either wrong data populating the wrong Redshift fields or not getting there at all.
The custom self-describing (unstructured) events and contexts are sent as JSONs. To load the data presented in such a form, we utilize Redshift’s COPY FROM JSON
command. JSONPaths files are the means by which this action becomes possible. The reason is the key/value pairs in the JSONs could be “assembled” in different order as well as the names of the keys in it might not match the names of the fields in the related Redshift table. Thus, JSONPaths files play a mapping role providing a match between the parameters (keys) in the JSON and the Redshift fields.
Let’s take a look how Amazon describes JSONPaths:
COPY uses a JSONPaths file to parse the JSON source data. A JSONPaths file is a text file that contains a single JSON object with the name “jsonpaths” paired with an array of JSONPath expressions…
In the Amazon Redshift COPY syntax, a JSONPath expression specifies the explicit path to a single name element in a JSON hierarchical data structure, using either bracket notation or dot notation.
(AWS docs: COPY from JSON Format)
The way to think of JSONPath is they are like XPath for XML document. That is the JSONPath allows to locate an element within the JSON.
JSONPaths expressions always refer to a JSON structure in the same way as XPath expression are used in combination with an XML document. Since a JSON structure is usually anonymous and doesn’t necessarily have a “root member object” JSONPath assumes the abstract name $
assigned to the outer level object. Thus, in dot notation (which Snowplow adheres to), the child element version_app
of the parent data
as in
{
"schema": "iglu:com.acme/context/jsonschema/1-0-0",
"data": {
"version_app": "1.7.0"
}
}
would be referenced as $.data.version_app
in JSONPaths file. It’s true regardless whether "data": { "version_app": "1.7.0" }
precedes or follows "schema": "iglu:com.acme/context/jsonschema/1-0-0"
.
Once the element got located we need to know which Redshift field it has to go to. As mentioned earlier, it is not trivial as the name of the field does not necessarily match the name of the JSON’s element.
If the JSON data objects don’t correspond directly to column names, you can use a JSONPaths file to map the JSON elements to columns. Again, the order does not matter in the JSON source data, but the order of the JSONPaths file expressions must match the column order.
(AWS docs: COPY Examples)
Thus, the mapping on the other side of the JSONPaths file is done by means of the element order. In other words, the position of the element (within the list of elements) will point to the Redshift field as it was created with CREATE TABLE
command.
Therefore, to be able to load the custom data successfully, the JSONPaths file should list (locate) all the expected parameters sent with a specific event/context and its order should match the corresponding order of the Redshift table.
This mapping role of JSONPaths files is depicted in the following diagram to make it more clear.