Consuming data in S3 through different services - Is there a way to use Glue Data Catalog? (or other solution)

Coqueiro · September 28, 2018, 1:23pm

I am still wondering if I can create a Glue Data Catalog crawler that reads through all fields, including JSON fields, with some logic given, to get contexts and unstruct_event data, but this seems too ambitious. As of now, I manually created a glue table with those settings:

I inserted all 125 columns in the schema, with proper description to ease the life of my coworkers. For that, I used those references:
Using AWS Athena to query the ‘good’ bucket on S3, by @dilyan
Canonical event model doc in Snowplow’s GitHub wiki

As of now, we are able to query data through Athena and other services using this data catalog, and through Athena we can create Views that get the relevant data from JSON fields. This table is lacking partitions, which should be a nice addition, but at least we have something going.

We are discussing other approaches, for example, using Firehose with a transformation lambda function to transform data before load, which should make computing time easier, or give the EmrEtlRunner a try, but we would love to hear other possibilities from people that already faced those problems (or not, just people with ideas ).

Topic		Replies	Views
Storage to Data Modeling to Analytics Data store sources	8	1794	December 22, 2020
Approaches to access data in S3 For data modelers & consumers	2	1621	May 18, 2021
Using AWS Athena to query the shredded events For data modelers & consumers	0	5564	August 4, 2017
Using AWS Athena and AWS Glue with Snowplow data For data modelers & consumers	0	1277	March 22, 2019
Replay data from S3 AWS real-time pipeline	3	2633	February 14, 2018

Consuming data in S3 through different services - Is there a way to use Glue Data Catalog? (or other solution)

Related topics