I am running a Clojure Collector with Google tags to track javascript events. We are trying to create a Snowplow + Snowflake pipeline. I have an s3 bucket for this that contains folders for each of the pieces of the pipeline. The snowplow-emr-elt-runner works for enriching the data. However, when I run the data-runner for the Snowflake transformer and loader I get: Output directory s3a://<bucketName>/<snowflakeFolder>/data/run=2018-03-16-10-13-56 already exists
In order for the transformer to properly run, i have to delete all the folders that are currently in s3a://<bucketName>/<snowflakeFolder>/data/
If this makes a difference, I noticed that the loader and transformer were both 0.3.0 in the playbook.json file when i was running the datarunner. I just switched the version before i sent it to you.
This isn’t something we’ve seen before. Also your playbook looks correct and switching to 0.3.1 shouldn’t have any unexpected effect.
However, I’m wondering if you’re trying to use persistent cluster? I.e. common snowflake loader architecture assumes that for each run new cluster is bootstrapped and then destroyed after finishing its job.
Also, what’s directory behind <name of directory>, is it your archive on S3 or HDFS path (presumably)?
Hi,
The <name of directory> is the S3 bucket where the snowflake data is stored. I have two separate folder in the bucket, one for the snowplow data to be stored while the other is for snowflake. In the snowplow/data/archive directory it would appear directories for each run are created and remain until deleted. This is the same case with the snowflake/data directory, it have to delete the directories created after each run.
Does it mean you already have processed data, so this problem just appeared?
Also, is <name of directory> is something like s3://mybucket/snowflake/data/archive/ or s3://mybucket/snowflake/data/archive/run=2018-03-22-10-00-00/? It seems that problem is simply that folder is really exist.
<name of directory> is like s3://mybucket/snowflake/data/archive/run=2018-03-22-10-00-00/. And yes, it seems to be because the folder exists. Should I be configuring the bucket to delete old runs?
Technically, you can add aws s3 rm statement in launching script after dataflow-runner, however I’m more confused now on why did Transformer tries to re-process it. Could you please share with me how DynamoDB manifest for this s3://mybucket/snowflake/data/archive/run=2018-03-22-10-00-00/ looks like?