We used a Kinesis Firehose to get data from the Stream Enrich stream, mainly because it’s auto-managed and already partition data into (year, month, day and hour). The problem is that the data that comes from the stream into the firehose is a TSV where records are not separated by newline n
(damn). We are using a lambda function to break the records every 130th \t
, which let’s us keep this data flow.
After that we had to deal with the Glue Catalog. We managed to create a Data Catalog with partitions by running a crawler in the actual data structure, editting the catalog schema and configurations to properly read the files, which are basically the configurations that I provided in the image above. Then I editted the crawler to don’t change table configurations, aside from partitions, and voilà, we have a catalog with date partitions (year, month, day, hour) that can be updated via crawlers. I also tried to set up a crawler with a GROK expression for classifier, but failed miserably.
Although this solution is somewhat messy, I would say that the issue is solved, but I think that this subject could lead to some interesting discussions, which I’m open for.
Just to pinpoint: this flow is not necessary to load data in Redshift or Elasticsearch, for example, which is, to the best of my knowledge, the actual Snowplow downstreaming pipeline (I guess this article is great to understand the moving pieces), but this gives my team flexibility to query data directly from our data lake, which has data from other pipelines.