I’ve had a problem with the snowflake loader skipping data which I’m trying to find a workaround for.
If the folder names are set to YYYY-MM-DD-HH (i.e. hourly), and the snowflake loader is running every 4 hours, whenever it’s run, it marks the current folder as complete even though the streaming enrich is still populating it with data. As a result, every 4 hours about 50% of the data for that hour doesn’t get loaded into snowflake.
I’ve found a workaround by setting the folder name to YYYY-MM-DD-HH-mm (i.e. every minute), which reduces the skipped data to no more than 1 minute every 4 hours. However this has increased the time taken to process using EMR considerably.
Is there a better way around this problem?
@iain, I’m not sure what you mean by “snowflake loader skipping data”. Here’s how we do it. Kinesis S3 Loader is typically configured to upload data to S3 every ~10 mins or so. Then we stage that data to archive bucket (separate from the one used by S3 Loader) and start the Snowflake Transformer and Loader. We also set the folder name to YYYY-MM-DD-hh-mm-ss. This job is scheduled depending on how big the latency of data in Snowflake DB is acceptable. Note that the folder name is not a representation of the skipped data as no data is expected to be “skipped” at all.
The more data is accumulated in the run folder the bigger EMR cluster your need to process that data.
When the folder name was YYYY-MM-DD-hh (which is the default in the config), we were finding that whenever the loader ran, the bucket which the loader ran in was getting marked as complete, so that the further data was being missed from it, so for example:
If the folder name is 2020-01-01-16, and Snowflake loader runs at 16:15 on 1/1/2020, then the Kinesis S3 loader was still writing data to that folder. However the Snowflake loader will mark it as complete, because it has no way of knowing that the folder is still being filled. So the next time the Snowflake loader runs, it won’t look in that folder for new events.
However, I can see that your approach of increasing the S3 Loader latency to 10 Minutes or so, and using a folder name of YYYY-MM-DD-hh-mm-ss will solve the problem I was having.
Is it worth updating the suggested defaults in the S3 Loader config to reflect the above? I’m happy to put in a PR if so