Prevent duplicates created by S3DistCp (Missing manifest file)

Hello,

Our Fargate instance spins up an EMR cluster to execute two steps. The first is S3DistCp and the second is the Shredder job. The EMR cluster gets terminated at the end of the two steps and the Fargate tasks spins up the next EMR cluster, right after. The issue that comes up is that on every S3DistCp run every object within the loader_bucket (enriched data) gets scanned and copied into the enriched_bucket under the new run directory and as a result, we will have many duplicates on the shredded_bucket and consequently, on the Redshift. Here is our dataflow_runner playbook setup:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
  "data": {
    "region": "eu-west-1",
    "credentials": {
      "accessKeyId": "AWS_ACCESS_KEY_ID",
      "secretAccessKey": "AWS_SECRET_ACCESS_KEY"
    },
    "steps": [
      {
        "type": "CUSTOM_JAR",
        "name": "S3DistCp enriched data archiving",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
        "arguments": [
            "--src", "SP_LOADER_URI",
            "--dest", "SP_ENRICHED_URI/run={{nowWithFormat "2006-01-02-15-04-05"}}/",
            "--srcPattern", ".*",
            "--outputCodec", "gz",
            "--outputManifest", "manifest-1.gz",
            "--previousManifest", "/usr/bin/manifest-1.gz",
            "--requirePreviousManifest", "false"
        ]
      },

      {
        "type": "CUSTOM_JAR",
        "name": "RDB Shredder",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "command-runner.jar",
        "arguments": [
            "spark-submit",
            "--class", "com.snowplowanalytics.snowplow.shredder.Main",
            "--master", "yarn",
            "--deploy-mode", "cluster",
            "s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar",
            "--iglu-config", "{{base64File "resolver.json"}}",
            "--config", "{{base64File "config.hocon"}}"
        ]
      }
    ],
    "tags": [ ]
  }
}

We have enabled “–outputManifest” option on S3DistCp to keep track of the already been copied files, but we could not find the manifest file anywhere within the EMR EC2 instances (or the Fargate container?).

What is the best solution to avoid creating duplicates? (With or without manifest)

@dadasami, I think you are missing one option from S3DistCp archiving step, "--deleteOnSuccess". As the files are not deleted at the source on move completion you are processing them again. I think this option replaces your manifest.

2 Likes