Error: Directory Already Exists when running Snowflake transformer

llabe027 · March 21, 2018, 2:07pm

Hello,

When I run the dataflow-runner to run the Snowflake Pipeline I receive the following error:

8/03/20 13:55:36 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory <name of directory> already exists

Is there something in the configuration files that must be set?

alex · March 21, 2018, 5:31pm

Hi @llabe027 - can you talk us through how you have set this up?

llabe027 · March 21, 2018, 5:41pm

I am running a Clojure Collector with Google tags to track javascript events. We are trying to create a Snowplow + Snowflake pipeline. I have an s3 bucket for this that contains folders for each of the pieces of the pipeline. The snowplow-emr-elt-runner works for enriching the data. However, when I run the data-runner for the Snowflake transformer and loader I get:
Output directory s3a://<bucketName>/<snowflakeFolder>/data/run=2018-03-16-10-13-56 already exists

In order for the transformer to properly run, i have to delete all the folders that are currently in s3a://<bucketName>/<snowflakeFolder>/data/

alex · March 21, 2018, 5:58pm

What are you using for your Dataflow Runner playbook?

llabe027 · March 21, 2018, 6:01pm

is this what you are referring to?

{
   "schema":"iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
   "data":{
      "region":"us-east-1",
      "credentials":{
         
      },
      "steps":[
         {
            "type":"CUSTOM_JAR",
            "name":"Snowflake Transformer",
            "actionOnFailure":"CANCEL_AND_WAIT",
            "jar":"command-runner.jar",
            "arguments":[
               "spark-submit",
               "--deploy-mode",
               "cluster",
               "--class",
               "com.snowplowanalytics.snowflake.transformer.Main",
               "s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.3.1.jar",
               "--config",
               "{{base64File "./loader.json"}}",
               "--resolver",
               "{{base64File "./iglu_resolver.json"}}"
            ]
         },

         {
            "type":"CUSTOM_JAR",
            "name":"Snowflake Loader",
            "actionOnFailure":"CANCEL_AND_WAIT",
            "jar":"s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.3.1.jar",
            "arguments":[
               "load",
               "--base64",
               "--config",
               "{{base64File "./loader.json"}}",
               "--resolver",
               "{{base64File "./iglu_resolver.json"}}"
            ]
         }
      ],
      "tags":[ ]
   }
}

llabe027 · March 21, 2018, 6:48pm

If this makes a difference, I noticed that the loader and transformer were both 0.3.0 in the playbook.json file when i was running the datarunner. I just switched the version before i sent it to you.

anton · March 22, 2018, 7:25am

Hey @llabe027,

This isn’t something we’ve seen before. Also your playbook looks correct and switching to 0.3.1 shouldn’t have any unexpected effect.

However, I’m wondering if you’re trying to use persistent cluster? I.e. common snowflake loader architecture assumes that for each run new cluster is bootstrapped and then destroyed after finishing its job.

Also, what’s directory behind <name of directory>, is it your archive on S3 or HDFS path (presumably)?

llabe027 · March 22, 2018, 12:54pm

Hi,
The <name of directory> is the S3 bucket where the snowflake data is stored. I have two separate folder in the bucket, one for the snowplow data to be stored while the other is for snowflake. In the snowplow/data/archive directory it would appear directories for each run are created and remain until deleted. This is the same case with the snowflake/data directory, it have to delete the directories created after each run.

anton · March 22, 2018, 1:41pm

Does it mean you already have processed data, so this problem just appeared?

Also, is <name of directory> is something like s3://mybucket/snowflake/data/archive/ or s3://mybucket/snowflake/data/archive/run=2018-03-22-10-00-00/? It seems that problem is simply that folder is really exist.

llabe027 · March 22, 2018, 1:48pm

<name of directory> is like s3://mybucket/snowflake/data/archive/run=2018-03-22-10-00-00/. And yes, it seems to be because the folder exists. Should I be configuring the bucket to delete old runs?

anton · March 22, 2018, 2:04pm

Technically, you can add aws s3 rm statement in launching script after dataflow-runner, however I’m more confused now on why did Transformer tries to re-process it. Could you please share with me how DynamoDB manifest for this s3://mybucket/snowflake/data/archive/run=2018-03-22-10-00-00/ looks like?

llabe027 · March 22, 2018, 2:47pm

Is this what you are referring to?

    AddedAtNumber:	1521554135
    AddedByString:	0.3.0
    RunIdString:	snowplow/data/archive/enriched/run=2018-03-16-10-13-56/
    ToSkipBoolean: false

anton · March 22, 2018, 4:23pm

Yep, that’s what I’m referring to. Thanks. I’ll try to figure out how is it possible that transformer is trying to overwrite the directory.

Right now, you can safely delete existing directory - from manifest record I can tell that it was not loaded yet.

Topic		Replies	Views
Shredded/bad-rows output directory already exists AWS batch pipeline (Legacy)	17	6163	June 6, 2018
Spark missing in Dataflow-runner Enrichment	25	3730	December 10, 2020
Snowflake Transformer Not a file error Troubleshooting	1	967	May 27, 2021
Snowflake Transformer Step HDFS Problems Troubleshooting	2	1362	January 5, 2021
Snowflake Transformer fails Storage targets	3	966	March 4, 2021

Error: Directory Already Exists when running Snowflake transformer

Related topics