Consuming data in S3 through different services - Is there a way to use Glue Data Catalog? (or other solution)

Coqueiro · October 2, 2018, 7:24pm

We used a Kinesis Firehose to get data from the Stream Enrich stream, mainly because it’s auto-managed and already partition data into (year, month, day and hour). The problem is that the data that comes from the stream into the firehose is a TSV where records are not separated by newline n (damn). We are using a lambda function to break the records every 130th \t, which let’s us keep this data flow.

After that we had to deal with the Glue Catalog. We managed to create a Data Catalog with partitions by running a crawler in the actual data structure, editting the catalog schema and configurations to properly read the files, which are basically the configurations that I provided in the image above. Then I editted the crawler to don’t change table configurations, aside from partitions, and voilà, we have a catalog with date partitions (year, month, day, hour) that can be updated via crawlers. I also tried to set up a crawler with a GROK expression for classifier, but failed miserably.

Although this solution is somewhat messy, I would say that the issue is solved, but I think that this subject could lead to some interesting discussions, which I’m open for.

Just to pinpoint: this flow is not necessary to load data in Redshift or Elasticsearch, for example, which is, to the best of my knowledge, the actual Snowplow downstreaming pipeline (I guess this article is great to understand the moving pieces), but this gives my team flexibility to query data directly from our data lake, which has data from other pipelines.

Topic		Replies	Views
Storage to Data Modeling to Analytics Data store sources	8	1794	December 22, 2020
Approaches to access data in S3 For data modelers & consumers	2	1621	May 18, 2021
Using AWS Athena to query the shredded events For data modelers & consumers	0	5565	August 4, 2017
Using AWS Athena and AWS Glue with Snowplow data For data modelers & consumers	0	1277	March 22, 2019
Replay data from S3 AWS real-time pipeline	3	2635	February 14, 2018

Consuming data in S3 through different services - Is there a way to use Glue Data Catalog? (or other solution)

Related topics