Hi all,
We have been implementing the following snowplow pipeline to load some data into snowflake.
Collector → Enricher → S3 Loader → (EMR FROM HERE) s3DistCP → Snowflake Transformer → Snowflake Loader → s3DistCP for archive
Up until the first s3DistCP, everything works fine, but when running the jobs on EMR, the transformer outputs the following error:
Caused by: java.io.IOException: Not a file: s3a://snowplow-events/enriched/archive/run=2021-05-26-13-21-12/2021/05
Im guessing that error appears because that is in fact not a file, its a folder. After s3distcp, the folder structure is as follows:
snowplow-events/enriched/archive/run=2021-05-26-13-21-12/YEAR/MONTH/DAY/HOUR
Is there some configuration i need to change to make it run correctly? This is the configuration for the s3distcp step:
“arguments”: [
“s3-dist-cp”,
“–src”,
“s3://snowplow-events/enriched/good/”,
“–dest”,
“s3://snowplow-events/enriched/archive/run={{nowWithFormat “2006-01-02-15-04-05”}}/”,
“–srcPattern”,
“.*\.gz”,
“–s3Endpoint”,
“s3.eu-west-1.amazonaws.com”,
“–s3ServerSideEncryption”
]
Thank you very much for all the support! Let me know if i need to provide more information.
Best regards,
Martin