More robust JSON parsing in Redshift with Python UDFs

yali · May 4, 2016, 10:13am

Most Snowplow users querying their data in Redshift wont need to parse JSONs in Redshift, because Snowplow shreds self describing events and custom contexts into their own tables.

Occasionally however, it will be necessary or desirable to work with JSONs in Redshift. A couple of examples:

Sometimes data is captured as arrays in varchar fields. A common example is the form_classes and elements fields in the submit_form_1 that is populated using the Javascript form tracking.
Sometimes it can be useful to create complex data types like arrays when doing analysis like funnel or pathing analysis. (Because this gives you the flexibility to aggregate steps in a user journey into a single line of data, without being limited or knowing the number of steps ahead of time)

Unfortunately Redshift’s inbuilt JSON parsing functions are very brittle: they’ll break if just one input data point is not a valid JSON.

We therefore recommend using Redshift’s support for Python UDFs to write more robust functions for parsing JSON data. At minimum, it is straightforward to create a simple function that checks that a JSON is validated:

create or replace function is_json(j varchar(max))
  returns boolean
  stable as $$
    import json
    try:
      json_object = json.loads(j)
    except ValueError, e:
      return False
    return True
  $$ language plpythonu;

This can then be used to with a CASE statement to filter out invalid JSONs prior to applying one of Redshift’s inbuilt JSON parsing functions:

SELECT
CASE WHEN is_json(my_json_field) THEN my_json_field ELSE '{}' END AS filtered_jsons
...

yali · May 4, 2016, 10:14am

The folks at Periscope data have published a useful set of UDFs for parsing JSONs in Redshift here.

Please reply to the thread with any other useful resources for JSON parsing in Redshift!

Topic		Replies	Views
Jsonschema -> jsonpaths data not loading in redshift Troubleshooting	2	787	March 21, 2022
Open sourced Redshift Extended tools New releases	2	977	August 24, 2018
Loading SQL enriched data in Redshift fails Redshift	2	2377	September 5, 2017
Strange behavior by Redshift StorageLoader - mapping "" to "f" and "0" to "t" Storage targets	6	2590	May 20, 2016
Can not upload data to redshift due double value in event For engineers	3	2100	February 11, 2019

More robust JSON parsing in Redshift with Python UDFs

Related topics