I am still wondering if I can create a Glue Data Catalog crawler that reads through all fields, including JSON fields, with some logic given, to get contexts
and unstruct_event
data, but this seems too ambitious. As of now, I manually created a glue table with those settings:
I inserted all 125 columns in the schema, with proper description to ease the life of my coworkers. For that, I used those references:
Using AWS Athena to query the ‘good’ bucket on S3, by @dilyan
Canonical event model doc in Snowplow’s GitHub wiki
As of now, we are able to query data through Athena and other services using this data catalog, and through Athena we can create Views that get the relevant data from JSON fields. This table is lacking partitions, which should be a nice addition, but at least we have something going.
We are discussing other approaches, for example, using Firehose with a transformation lambda function to transform data before load, which should make computing time easier, or give the EmrEtlRunner a try, but we would love to hear other possibilities from people that already faced those problems (or not, just people with ideas ).