We are using the stream enricher pipeline. All enriched events are being store in a S3 using the kinesis-s3 sink. The problem is, that all files end up in the same directory. The only way we can tell the dates apart are the “-” separated file name prefix.
This sort of file organization makes it practically impossible to be used with AWS Athena to perform adhoc queries on that data as there is only one partition which might grow really big. In our example it’s close to 40TB.
I suggest to write files belonging to a given day into it’s own directory. Instead of writing files like this:
for example: 2016-04-27-49561210860930246946106208671673286283317939965855268866-49561210860930246946106208671675704134957169292924157954.gz
The desired structure would be:
This would allow AWS Athena, or any software accessing that data, to load only the data from the dates we are interested in by specifying a path.
As i understand, that batch and real time pipeline are supposed to work in a similar way, it would mean that both parts of the pipeline need to be adjusted to pick up enriched data from the directory.
Would the proposed change be useful for anyone else except us? How much of an effort would that imply including to migrate existing data?
I think long term there’s definitely some advantages to moving to a format that’s easier to filter based on prefix.
A temporary fix might be to write a Lambda that triggers on PUTs to the bucket and copies (or moves) the key name from s3://original-bucket/YYYY-MM-DD-start_pointer-end_pointer.gz to s3://new-bucket/YYYY/MM/DD/start_pointer-end_pointer.gz?
When does AWS plan to release the new data catalog service? Definitely within a 12 month roadmap, but I don’t have the details. Would it then be better to simply add/remove data catalog entries every time s3 object is created or moved or deleted? This way data can be analyzed by redshift spectrum, athena and possibly other users of the catalog. And it would come cheap with a few API calls here and there.
This came up at our meeting with AWS sales and support. It may be a part of AWS Glue ( https://aws.amazon.com/glue ) or a stand-alone product. What was discussed, was an imminent GA release sometime Q3-Q4 2017, possibly but unlikely later. This is specifically in a context of s3 data lake and accessibility through redshift spectrum, which requires a data catalog. We first thought of adding s3 objects to Athena for exposure to Spectrum, but through the conversation it was suggested to wait until the proper solution hits the market.
Thanks @dashirov for the link. Sounds pretty interesting, let’s see.
But in any way, wouldn’t it be beneficial, to have an option for your config, where to write the files?
Reconsidering AWS lambda: As AWS S3 does not support a “move” command, the lambda function would need to download the content of the file and upload it back to a different location. And in case it was successful, the lambda function would need to delete the old file. That would essentially double the write access to S3. Does not sound ideal to me.
We are definitely interested in doing something here; @dilyan currently has a Discourse post on working with Snowplow data from Athena in draft, which we hope to publish soon. (After which we will circle back on this.)
The S3 Data Lake / Data Catalog sounds interesting!