Snowflake Transformer fails

Past Sunday and the Sunday before we suddenly had 4 consecutive Snowflake transformer jobs that failed (we load hourly, so about 8 hours of data are not loaded). We tried to re-submit these jobs but they failed again. I went through the EMR logs in S3 but can’t find any specifics on what went wrong. The jobs failed after many minutes of running (e.g. the one I’m looking at now ran for 34 minutes). Before the same setup ran for months without issues.

What’s the correct way to debug / fix this? In which log files should I be able to find the error?

This is our playbook.json:

{
   "schema":"iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
   "data":{
      "region":"us-west-2",
      "credentials":{
          "accessKeyId": "xxxxxx",
          "secretAccessKey": "xxxxxx"
      },
      "steps":[
         {
            "type":"CUSTOM_JAR",
            "name":"Snowflake Transformer",
            "actionOnFailure":"CANCEL_AND_WAIT",
            "jar":"command-runner.jar",
            "arguments":[
               "spark-submit",
               "--conf",
               "spark.hadoop.mapreduce.job.outputformat.class=com.snowplowanalytics.snowflake.transformer.S3OutputFormat",
               "--deploy-mode",
               "cluster",
               "--class",
               "com.snowplowanalytics.snowflake.transformer.Main",
               "s3://snowplow-prod-assets/snowplow-snowflake-transformer-0.6.0.jar",

               "--config",
               "{{base64File "./snowflake_etl.conf"}}",
               "--resolver",
               "{{base64File "./iglu_resolver_snowflake.json"}}",
               "--inbatch-deduplication"
            ]
         },

         {
            "type":"CUSTOM_JAR",
            "name":"Snowflake Loader",
            "actionOnFailure":"CANCEL_AND_WAIT",
            "jar":"s3://snowplow-prod-assets/snowplow-snowflake-loader-0.6.0.jar",
            "arguments":[
               "load",
               "--base64",
               "--config",
               "{{base64File "./snowflake_etl.conf"}}",
               "--resolver",
               "{{base64File "./iglu_resolver_snowflake.json"}}"
            ]
         }
      ],
      "tags":[ ]
   }
}

and this is our snowflake_etl.conf:

{
  "schema": "iglu:com.snowplowanalytics.snowplow.storage/snowflake_config/jsonschema/1-0-2",
  "data": {
    "name": "Snowplow Snowflake ETL Config",
    "awsRegion": "us-west-2",
    "auth": {
        "roleArn": "arn:aws:iam::xxxx:role/snowplow-prod-snowflake-loader",
        "sessionDuration": 900
    },
    "manifest": "snowplow-prod-snowflake-manifest",
    "snowflakeRegion": "us-west-2",
    "jdbcHost": "xxxx.snowflakecomputing.com",
    "database": "xxxx",
    "input": "s3://snowplow-prod-good/YYYY-MM-DD-HH/",
    "stage": "snowplow_stage",
    "badOutputUrl": "s3://snowplow-prod-transformed/badrow",
    "stageUrl": "s3://snowplow-prod-transformed/stage",
    "warehouse": "snowplow_etl_wh",
    "schema": "atomic",
    "account": "xxxxx",
    "username": "snowplow_loader",
    "password": {
        "ec2ParameterStore": {
            "parameterName": "snowplow.prod.snowflake.password"
        }
    },
    "maxError": 1,
    "purpose": "ENRICHED_EVENTS"
  }
}
1 Like

@boba, the EMR logs are the one to look into. As a minimum, you should be able to see which step failed - transformation or loading. The reason for failure is also normally in the logs. Although, they might nor be obvious. In particular, the error message that comes from Snowflake DB (failure at loading step)

The most likely failure is unde-provisioned EMR cluster. You might need to bump it if the failure took during data transformation.

You might also need to check the status of the job as captured in DynamoDB’s manifest table. It might be inconsistent with the actual status of the data. For example, run folder was created and some data/files have been actually transformed and present in the stageUrl S3 bucket but the status in the manifest table is recorded as NEW which implies no files should be in stageUrl's corresponding run folder. In this scenario, you would delete that run folder from stageUrl S3 bucket and the corresponding row in DynamoDB before resuming your Snowflake job.

Additionally, I’m actually surprised to see input with YYYY-MM-DD-HH folder. I would expect to see the parent location of all YYYY-MM-DD-HH folders rather than a specific one. Was it added there as an attempt to reprocess the failed run?

The error happens in the transformer step. I went through all logs in the corresponding S3 folder and could not find an error message. Do we need to enable additional logging for this?

I think we added the YYYY-MM-DD-HH to the input path so that we can easier re-run failed jobs. The bad part about this is that all our files are located in s3_folder/YYYY-MM-DD-HH/YYYY-MM-DD-HH, so there is one unnecessary depth.

@boba,

Do we need to enable additional logging for this?

No

The bad part about this is that all our files are located in s3_folder/YYYY-MM-DD-HH/YYYY-MM-DD-HH, so there is one unnecessary depth.

This could be problematic. If I’m not mistaken there expected to be one “run” folder, which is in the form run=YYYY-MM-DD-hh-mm-ss. That is taken the path current path for input as s3://snowplow-prod-good/YYYY-MM-DD-HH/ the data to be transformed would be in run=YYYY-MM-DD-hh-mm-ss where YYYY-MM-DD-hh-mm-ss is a valid timestamp.

This kind of manipulation is also problematic as DynamoDB reflects the actual location of the enriched data batches. When you revert back to the original input the very same data will be attempted to be transformed (and loaded) again unless the “original” input is a completely different bucket/path and you have moved the “offending” batch out of it.