R35 Shredder - no data in shredded bucket

mgloel · February 11, 2021, 6:28pm

Hey,

we setup the dataflow runner with the two jobs:

S3DistCp and
Shredder
We followed this documentation:
R35 Upgrade Guide - Snowplow Docs

The first job works fine and the data in our enriched bucket looks like this:

The second job finishes without an error but there is no data on our s3 bucket s3://sp-shredded/bad and s3://sp-shredded/good.
Our playbook.json and config.hocon follow the sample from the docs:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
  "data": {
    "region": "eu-west-1",
    "credentials": {
      "accessKeyId": "AWS_ACCESS_KEY_ID",
      "secretAccessKey": "AWS_SECRET_ACCESS_KEY"
    },
    "steps": [
      {
        "type": "CUSTOM_JAR",
        "name": "S3DistCp enriched data archiving",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
        "arguments": [
            "--src", "SP_LOADER_URI",
            "--dest", "SP_ENRICHED_URI/run={{nowWithFormat "2006-01-02-15-04-05"}}/",
            "--srcPattern", ".*",
            "--outputCodec", "gz"
        ]
      },

      {
        "type": "CUSTOM_JAR",
        "name": "RDB Shredder",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "command-runner.jar",
        "arguments": [
            "spark-submit",
            "--class", "com.snowplowanalytics.snowplow.shredder.Main",
            "--master", "yarn",
            "--deploy-mode", "cluster",
            "s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar",
            "--iglu-config", "{{base64File "resolver.json"}}",
            "--config", "{{base64File "config.hocon"}}"
        ]
      }
    ],
    "tags": [ ]
  }
}

config.hocon

{
  "name": "myapp",
  "id": "123e4567-e89b-12d3-a456-426655440000",

  "region": "eu-west-1",
  "messageQueue": "messages.fifo",

  "shredder": {
    "input": "SP_ENRICHED_URI",
    "output": "SP_SHREDDED_GOOD_URI",
    "outputBad": "SP_SHREDDED_BAD_URI",
    "compression": "GZIP"
  },

  "formats": {
    "default": "TSV",
    "json": [ ],
    "tsv": [ ],
    "skip": [ ]
  },

  "storage" = {
    "type": "redshift",
    "host": "redshift.amazon.com",
    "database": "snowplow",
    "port": 5439,
    "roleArn": "arn:aws:iam::123456789012:role/RedshiftLoadRole",
    "schema": "atomic",
    "username": "storage-loader",
    "password": "secret",
    "jdbc": {"ssl": true},
    "maxError": 10,
    "compRows": 100000
  },

  "steps": ["analyze"],

  "monitoring": {
    "snowplow": null,
    "sentry": null
  }
}

The envs are set like this:
SP_ENRICHED_URI: s3://sp-enriched-stage
SP_SHREDDED_GOOD_URI: s3://sp-shredded-stage/good/
SP_SHREDDED_BAD_URI: s3://sp-shredded-stage/bad/

The shredder log did not provide any obvious info of what might have gone wrong:

21/02/11 18:09:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/02/11 18:09:00 WARN DependencyUtils: Skip remote jar s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar.
21/02/11 18:09:01 INFO RMProxy: Connecting to ResourceManager at ip-11-222-59-27.eu-west-1.compute.internal/11.222.59.27:8032
21/02/11 18:09:01 INFO Client: Requesting a new application from cluster with 1 NodeManagers
21/02/11 18:09:01 INFO Configuration: resource-types.xml not found
21/02/11 18:09:01 INFO ResourceUtils: Unable to find 'resource-types.xml'.
21/02/11 18:09:01 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
21/02/11 18:09:01 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
21/02/11 18:09:01 INFO Client: Setting up container launch context for our AM
21/02/11 18:09:01 INFO Client: Setting up the launch environment for our AM container
21/02/11 18:09:01 INFO Client: Preparing resources for our AM container
21/02/11 18:09:01 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
21/02/11 18:09:03 INFO Client: Uploading resource file:/mnt/tmp/spark-3459d99c-3757-4ddf-b373-f461a5090dd8/__spark_libs__5331094724240217805.zip -> hdfs://ip-11-222-59-27.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1613066684194_0002/__spark_libs__5331094724240217805.zip
21/02/11 18:09:05 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
21/02/11 18:09:05 INFO Client: Uploading resource s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar -> hdfs://ip-11-222-59-27.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1613066684194_0002/snowplow-rdb-shredder-0.19.0.jar
21/02/11 18:09:06 INFO S3NativeFileSystem: Opening 's3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar' for reading
21/02/11 18:09:09 INFO Client: Uploading resource file:/mnt/tmp/spark-3459d99c-3757-4ddf-b373-f461a5090dd8/__spark_conf__5236745401879396821.zip -> hdfs://ip-11-222-59-27.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1613066684194_0002/__spark_conf__.zip
21/02/11 18:09:09 INFO SecurityManager: Changing view acls to: hadoop
21/02/11 18:09:09 INFO SecurityManager: Changing modify acls to: hadoop
21/02/11 18:09:09 INFO SecurityManager: Changing view acls groups to: 
21/02/11 18:09:09 INFO SecurityManager: Changing modify acls groups to: 
21/02/11 18:09:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
21/02/11 18:09:09 INFO Client: Submitting application application_1613066684194_0002 to ResourceManager
21/02/11 18:09:09 INFO YarnClientImpl: Submitted application application_1613066684194_0002
21/02/11 18:09:10 INFO Client: Application report for application_1613066684194_0002 (state: ACCEPTED)
21/02/11 18:09:10 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1613066949805
	 final status: UNDEFINED
	 tracking URL: http://ip-11-222-59-27.eu-west-1.compute.internal:20888/proxy/application_1613066684194_0002/
	 user: hadoop

anton · February 11, 2021, 6:50pm

Hey @mgloel!

Your configs look correct to me.

How long did your shred job take? If it took less then a minute - it likely could not find the data.

As of EMR logs I’d recommend you to have a look at container logs, this is where all job output goes. They’re usually in:

s3://cluster-logs/j-YOUR_JOB_ID/containers/application_1613064542578_0002/

I’d expect to find messages listing all folders it found or all corrupted folders (which you should not have).

anton · February 11, 2021, 7:03pm

Also @mgloel, please make sure your folders in SP_ENRICHED_URI are not empty.

mgloel · February 11, 2021, 7:34pm

Hey @anton ,
thanks for the very quick reply.

The sp-enriched bucket is not empty but it creates these strange objects (see screenshot above): `run=2021-02-11-16-33-11_folder``
I checked the container logs:
s3://cluster-logs/j-YOUR_JOB_ID/containers/application_1613064542578_0002/
It contains:
├── directory.txt
├── launch_container.txt
├── prelaunch.txt
├── stderr.txt
└── stdout.txt

stderr.txt

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/yarn/usercache/hadoop/filecache/10/__spark_libs__531013135952600484.zip/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/11 19:19:02 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 13331@ip-11-222-59-30
21/02/11 19:19:02 INFO SignalUtils: Registered signal handler for TERM
21/02/11 19:19:02 INFO SignalUtils: Registered signal handler for HUP
21/02/11 19:19:02 INFO SignalUtils: Registered signal handler for INT
21/02/11 19:19:03 INFO SecurityManager: Changing view acls to: yarn,hadoop
21/02/11 19:19:03 INFO SecurityManager: Changing modify acls to: yarn,hadoop
21/02/11 19:19:03 INFO SecurityManager: Changing view acls groups to: 
21/02/11 19:19:03 INFO SecurityManager: Changing modify acls groups to: 
21/02/11 19:19:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users  with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
21/02/11 19:19:03 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

stdout.txt

2021-02-11T19:19:02.082+0000: [GC (Allocation Failure) [PSYoungGen: 64000K->5819K(74240K)] 64000K->5835K(243712K), 0.0097924 secs] [Times: user=0.01 sys=0.00, real=0.02 secs] 
2021-02-11T19:19:02.639+0000: [GC (Allocation Failure) [PSYoungGen: 69819K->6547K(74240K)] 69835K->6563K(243712K), 0.0187903 secs] [Times: user=0.01 sys=0.00, real=0.02 secs] 
2021-02-11T19:19:03.024+0000: [GC (Allocation Failure) [PSYoungGen: 70547K->6617K(74240K)] 70563K->6641K(243712K), 0.0043435 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2021-02-11T19:19:03.368+0000: [GC (Allocation Failure) [PSYoungGen: 70617K->7534K(138240K)] 70641K->7566K(307712K), 0.0087980 secs] [Times: user=0.02 sys=0.01, real=0.01 secs] 
2021-02-11T19:19:03.568+0000: [GC (Metadata GC Threshold) [PSYoungGen: 73181K->7687K(138240K)] 73213K->7727K(307712K), 0.0065954 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2021-02-11T19:19:03.575+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 7687K->0K(138240K)] [ParOldGen: 40K->7519K(91648K)] 7727K->7519K(229888K), [Metaspace: 20287K->20287K(1067008K)], 0.0269414 secs] [Times: user=0.07 sys=0.01, real=0.02 secs] 
Heap
 PSYoungGen      total 138240K, used 14317K [0x000000075d300000, 0x000000076de00000, 0x00000007c0000000)
  eden space 128000K, 11% used [0x000000075d300000,0x000000075e0fb4e0,0x0000000765000000)
  from space 10240K, 0% used [0x0000000765000000,0x0000000765000000,0x0000000765a00000)
  to   space 8704K, 0% used [0x000000076d580000,0x000000076d580000,0x000000076de00000)
 ParOldGen       total 91648K, used 7519K [0x0000000697800000, 0x000000069d180000, 0x000000075d300000)
  object space 91648K, 8% used [0x0000000697800000,0x0000000697f57f98,0x000000069d180000)
 Metaspace       used 20662K, capacity 21606K, committed 21760K, reserved 1069056K
  class space    used 2926K, capacity 3070K, committed 3072K, reserved 1048576K

I don’t really see an obvious error
in these.
Could it be that the S3DistCp did copy the files but not process them correctly so that the shredder does not pick them up properly?

anton · February 11, 2021, 7:41pm

I’m bit suspicious about this last line:

21/02/11 19:19:03 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

Can you show your cluster.json file?

mgloel · February 11, 2021, 7:45pm

Sure.

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
  "data": {
    "name": "com.myapp", 
    "logUri": "LOGURI",
    "region": "eu-west-1", 
    "credentials": {
      "accessKeyId": "AWS_ACCESS_KEY_ID", 
      "secretAccessKey": "AWS_SECRET_ACCESS_KEY"
    },
    "roles": {
      "jobflow": "EMR_EC2_DefaultRole",
      "service": "EMR_DefaultRole"
    },
    "ec2": { 
      "amiVersion": "6.2.0", 
      "keyName": "public-subnet-test",
      "location": {
        "vpc": {
          "subnetId": "AWS_SUBNET_PUBLIC_ID"
        }
      },
      "instances": {
        "master": {
          "type": "m5.xlarge"
        },
        "core": {
          "type": "m5.xlarge",
          "count": 1
        },
        "task": {
          "type": "m5.xlarge",
          "count": 0,
          "bid": "0.015"
        }
      }
    },
    "tags": [ 
      {
        "key": "client",
        "value": "com.myapp"
      },
      {
        "key": "job",
        "value": "main"
      }
    ],
    "bootstrapActionConfigs": [
      {
        "name": "Elasticity Bootstrap Action",
        "scriptBootstrapAction": {
          "path": "s3://snowplow-hosted-assets-eu-central-1/common/emr/snowplow-ami4-bootstrap-0.2.0.sh",
          "args": [ "1.5" ]
        }
      }
    ],
    "configurations": [
      {
        "classification": "core-site",
        "properties": {
          "Io.file.buffer.size": "65536"
        }
      },
      {
        "classification": "mapred-site",
        "properties": {
          "Mapreduce.user.classpath.first": "true"
        }
      }
    ],
    "applications": [ "Hadoop", "Spark" ]
  }
}

anton · February 11, 2021, 8:10pm

Also nothing looks horribly wrong. Just:

Bootstrap action is certainly redundant here
So far we tested it with AMI 6.1.0 only.

Here’s my cluster.json from last run:

{
    "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
    "data": {
        "logUri": "s3://LOG-BUCET/",
        "name": "My run",
        "region": "eu-central-1",
        "roles": {
            "jobflow": "EMR_EC2_DefaultRole",
            "service": "EMR_DefaultRole"
        },
        "applications": [ "Hadoop", "Spark" ],
        "bootstrapActionConfigs": [],
        "configurations": [
            {
                "classification": "spark",
                "configurations": [],
                "properties": {
                    "maximizeResourceAllocation": "false"
                }
            },
            {
                "classification": "spark-defaults",
                "configurations": [],
                "properties": {
                    "spark.default.parallelism": "8",
                    "spark.driver.cores": "1",
                    "spark.driver.memory": "9G",
                    "spark.dynamicAllocation.enabled": "false",
                    "spark.executor.cores": "1",
                    "spark.executor.instances": "2",
                    "spark.executor.memory": "9G",
                    "spark.yarn.driver.memoryOverhead": "1024",
                    "spark.yarn.executor.memoryOverhead": "1024"
                }
            },
            {
                "classification": "yarn-site",
                "configurations": [],
                "properties": {
                    "yarn.nodemanager.resource.memory-mb": "30720",
                    "yarn.nodemanager.vmem-check-enabled": "false",
                    "yarn.scheduler.maximum-allocation-mb": "30720"
                }
            }
        ],
        "credentials": {
            "accessKeyId": "env",
            "secretAccessKey": "env"
        },
        "ec2": {
            "amiVersion": "6.1.0",
            "instances": {
                "core": {
                    "count": 1,
                    "type": "r5.xlarge"
                },
                "master": {
                    "ebsConfiguration": {
                        "ebsBlockDeviceConfigs": [],
                        "ebsOptimized": true
                    },
                    "type": "m4.large"
                },
                "task": {
                    "bid": "0.015",
                    "count": 0,
                    "type": "m4.large"
                }
            },
            "keyName": "public-subnet-test",
            "location": {
                "vpc": {
                    "subnetId": "subnet-049869bb50740751b"
                }
            }
        },
        "tags": []
    }
}

anton · February 11, 2021, 8:54pm

Also I think that the container logs you’ve posted are actually from S3DistCp step. Are there any other applications/directories?

mgloel · February 11, 2021, 9:16pm

Hey,

I copied your cluster settings but I am still getting the
`ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM``

I took the logs from this location:
s3:/mylogs/j-2Z34BMC7AMTK2/containers/application_1613076869503_0002/container_1613076869503_0002_01_000002/stderr

This is the log directory content of the last job I ran:


├── containers
│   ├── application_1613076869503_0001
│   │   ├── container_1613076869503_0001_01_000001
│   │   │   ├── directory.info.gz
│   │   │   ├── launch_container.sh.gz
│   │   │   ├── prelaunch.out.gz
│   │   │   ├── stderr.gz
│   │   │   └── syslog.gz
│   │   ├── container_1613076869503_0001_01_000002
│   │   │   ├── directory.info.gz
│   │   │   ├── launch_container.sh.gz
│   │   │   ├── prelaunch.out.gz
│   │   │   └── syslog.gz
│   │   ├── container_1613076869503_0001_01_000003
│   │   │   ├── directory.info.gz
│   │   │   ├── launch_container.sh.gz
│   │   │   ├── prelaunch.out.gz
│   │   │   ├── syslog.gz
│   │   │   └── syslog.shuffle.gz
│   │   ├── container_1613076869503_0001_01_000004
│   │   │   ├── directory.info.gz
│   │   │   ├── launch_container.sh.gz
│   │   │   ├── prelaunch.out.gz
│   │   │   ├── syslog.gz
│   │   │   └── syslog.shuffle.gz
│   │   ├── container_1613076869503_0001_01_000005
│   │   │   ├── directory.info.gz
│   │   │   ├── launch_container.sh.gz
│   │   │   ├── prelaunch.out.gz
│   │   │   ├── syslog.gz
│   │   │   └── syslog.shuffle.gz
│   │   ├── container_1613076869503_0001_01_000006
│   │   │   ├── directory.info.gz
│   │   │   ├── launch_container.sh.gz
│   │   │   ├── prelaunch.out.gz
│   │   │   ├── syslog.gz
│   │   │   └── syslog.shuffle.gz
│   │   ├── container_1613076869503_0001_01_000007
│   │   │   ├── directory.info.gz
│   │   │   ├── launch_container.sh.gz
│   │   │   ├── prelaunch.out.gz
│   │   │   ├── syslog.gz
│   │   │   └── syslog.shuffle.gz
│   │   ├── container_1613076869503_0001_01_000008
│   │   │   ├── directory.info.gz
│   │   │   ├── launch_container.sh.gz
│   │   │   ├── prelaunch.out.gz
│   │   │   ├── syslog.gz
│   │   │   └── syslog.shuffle.gz
│   │   ├── container_1613076869503_0001_01_000009
│   │   │   ├── directory.info.gz
│   │   │   ├── launch_container.sh.gz
│   │   │   ├── prelaunch.out.gz
│   │   │   ├── syslog.gz
│   │   │   └── syslog.shuffle.gz
│   │   ├── container_1613076869503_0001_01_000010
│   │   │   ├── directory.info.gz
│   │   │   ├── launch_container.sh.gz
│   │   │   ├── prelaunch.out.gz
│   │   │   ├── syslog.gz
│   │   │   └── syslog.shuffle.gz
│   │   ├── container_1613076869503_0001_01_000011
│   │   │   ├── directory.info.gz
│   │   │   ├── launch_container.sh.gz
│   │   │   ├── prelaunch.out.gz
│   │   │   ├── syslog.gz
│   │   │   └── syslog.shuffle.gz
│   │   └── container_1613076869503_0001_01_000012
│   │       ├── directory.info.gz
│   │       ├── launch_container.sh.gz
│   │       ├── prelaunch.out.gz
│   │       ├── syslog.gz
│   │       └── syslog.shuffle.gz
│   └── application_1613076869503_0002
│       ├── container_1613076869503_0002_01_000001
│       │   ├── directory.info.gz
│       │   ├── launch_container.sh.gz
│       │   ├── prelaunch.out.gz
│       │   └── stderr.gz
│       ├── container_1613076869503_0002_01_000002
│       │   ├── directory.info.gz
│       │   ├── launch_container.sh.gz
│       │   ├── prelaunch.out.gz
│       │   ├── stderr.gz
│       │   └── stdout.gz
│       └── container_1613076869503_0002_01_000003
│           ├── directory.info.gz
│           ├── launch_container.sh.gz
│           ├── prelaunch.out.gz
│           ├── stderr.gz
│           └── stdout.gz
├── hadoop-mapreduce
│   └── history
│       └── 2021
│           └── 02
│               └── 11
│                   └── 000000
│                       ├── job_1613076869503_0001-1613076964485-hadoop-S3DistCp%3A+s3n%3A%2F%2Fsp%2Dloader%2Dgitc%2Ddev-1613077102145-1-10-SUCCEEDED-default-1613076970763.jhist.gz
│                       └── job_1613076869503_0001_conf.xml.gz
├── node
│   ├── i-0606408ec82adc1ba
│   │   ├── applications
│   │   │   ├── hadoop-hdfs
│   │   │   │   ├── hadoop-hdfs-datanode-ip-11-222-59-28.log.gz
│   │   │   │   └── hadoop-hdfs-datanode-ip-11-222-59-28.out.gz
│   │   │   └── hadoop-yarn
│   │   │       ├── hadoop-yarn-nodemanager-ip-11-222-59-28.log.gz
│   │   │       └── hadoop-yarn-nodemanager-ip-11-222-59-28.out.gz
│   │   ├── daemons
│   │   │   ├── instance-state
│   │   │   │   ├── console.log-2021-02-11-20-50.gz
│   │   │   │   └── instance-state.log-2021-02-11-21-00.gz
│   │   │   └── setup-dns.log.gz
│   │   ├── provision-node
│   │   │   ├── 33066257-7fcf-4839-8ce9-9ced8c026037
│   │   │   │   ├── controller.gz
│   │   │   │   └── stderr.gz
│   │   │   ├── apps-phase
│   │   │   │   ├── 0
│   │   │   │   │   └── 33066257-7fcf-4839-8ce9-9ced8c026037
│   │   │   │   │       ├── puppet.log.gz
│   │   │   │   │       ├── stderr.gz
│   │   │   │   │       └── stdout.gz
│   │   │   │   └── install.stderr.gz
│   │   │   └── reports
│   │   │       └── 0
│   │   │           └── 33066257-7fcf-4839-8ce9-9ced8c026037
│   │   │               └── ip-11-222-59-28.eu-west-1.compute.internal
│   │   │                   ├── 202102112052.puppetreport.json.gz
│   │   │                   └── 202102112052.yaml.gz
│   │   └── setup-devices
│   │       ├── DiskEncryptor.log.gz
│   │       ├── setup_drives.log.gz
│   │       ├── setup_emr_metrics.log.gz
│   │       ├── setup_tmp_dir.log.gz
│   │       ├── setup_var_cache_dir.log.gz
│   │       ├── setup_var_lib_dir.log.gz
│   │       ├── setup_var_log_dir.log.gz
│   │       └── setup_var_tmp_dir.log.gz
│   └── i-0b98a138956452357
│       ├── applications
│       │   ├── hadoop
│       │   │   └── steps
│       │   │       ├── s-2QNG6ALXLNMJA
│       │   │       │   ├── controller.gz
│       │   │       │   └── syslog.gz
│       │   │       └── s-3C0H9S6WDU7IH
│       │   │           ├── controller.gz
│       │   │           └── stderr.gz
│       │   ├── hadoop-hdfs
│       │   │   ├── hadoop-hdfs-namenode-ip-11-222-59-21.log.gz
│       │   │   ├── hadoop-hdfs-namenode-ip-11-222-59-21.out.gz
│       │   │   └── nn.format.log.gz
│       │   ├── hadoop-httpfs
│       │   │   ├── hadoop-httpfs-httpfs-ip-11-222-59-21.log.gz
│       │   │   └── hadoop-httpfs-httpfs-ip-11-222-59-21.out.gz
│       │   ├── hadoop-kms
│       │   │   ├── hadoop-kms-kms-ip-11-222-59-21.out.gz
│       │   │   └── kms.log.gz
│       │   ├── hadoop-mapreduce
│       │   │   ├── hadoop-mapred-historyserver-ip-11-222-59-21.log.gz
│       │   │   └── hadoop-mapred-historyserver-ip-11-222-59-21.out.gz
│       │   ├── hadoop-yarn
│       │   │   ├── hadoop-yarn-proxyserver-ip-11-222-59-21.log.gz
│       │   │   ├── hadoop-yarn-proxyserver-ip-11-222-59-21.out.gz
│       │   │   ├── hadoop-yarn-resourcemanager-ip-11-222-59-21.log.gz
│       │   │   ├── hadoop-yarn-resourcemanager-ip-11-222-59-21.out.gz
│       │   │   ├── hadoop-yarn-timelineserver-ip-11-222-59-21.log.gz
│       │   │   └── hadoop-yarn-timelineserver-ip-11-222-59-21.out.gz
│       │   ├── livy
│       │   │   └── livy-livy-server.out.gz
│       │   └── spark
│       │       └── spark-history-server.out.gz
│       ├── bootstrap-actions
│       │   └── master.log.gz
│       ├── daemons
│       │   ├── instance-state
│       │   │   ├── console.log-2021-02-11-20-50.gz
│       │   │   └── instance-state.log-2021-02-11-21-00.gz
│       │   └── setup-dns.log.gz
│       ├── provision-node
│       │   ├── a257d167-495e-4903-9ad0-b4bc86352c9e
│       │   │   ├── controller.gz
│       │   │   └── stderr.gz
│       │   ├── apps-phase
│       │   │   ├── 0
│       │   │   │   └── a257d167-495e-4903-9ad0-b4bc86352c9e
│       │   │   │       ├── puppet.log.gz
│       │   │   │       ├── stderr.gz
│       │   │   │       └── stdout.gz
│       │   │   └── install.stderr.gz
│       │   └── reports
│       │       └── 0
│       │           └── a257d167-495e-4903-9ad0-b4bc86352c9e
│       │               └── ip-11-222-59-21.eu-west-1.compute.internal
│       │                   ├── 202102112054.puppetreport.json.gz
│       │                   └── 202102112054.yaml.gz
│       └── setup-devices
│           ├── DiskEncryptor.log.gz
│           ├── setup_drives.log.gz
│           ├── setup_emr_metrics.log.gz
│           ├── setup_tmp_dir.log.gz
│           ├── setup_var_cache_dir.log.gz
│           ├── setup_var_lib_dir.log.gz
│           ├── setup_var_log_dir.log.gz
│           └── setup_var_tmp_dir.log.gz
└── steps
    ├── s-2QNG6ALXLNMJA
    │   ├── controller.gz
    │   └── syslog.gz
    └── s-3C0H9S6WDU7IH
        ├── controller.gz
        └── stderr.gz

Which file contains the error log of the shredder?

The same error appear is the other application application_1613078388636_0001/. So both 0001 and 0002:

21/02/11 21:24:34 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.

21/02/11 21:24:34 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM`

The amount of data in the enriched bucket in each run folder is quite small:
1.233 files - 524 KB

mgloel · February 12, 2021, 10:00am

A short update. We just noticed that we forgot to setup the FIFO SQS queue as suggested here:

However, we do not see any activity on the queue either.

The ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM appears in the error logs on both jobs. Following your cluster config the dynamicAllocation is already disabled.

anton · February 12, 2021, 10:02am

Hey @mgloel,

I tested few hypotheses, including those _$folder$ artifacts and so far evertying works as expected. I’d recommend you to check paths and regions one more time.

Which file contains the error log of the shredder?

I think it must be somewhere in application_1613076869503_0002. If you have 3 containers there I’d expect the job being working for some time.

As 0.19.0 is still experimental - we’re interested in catching all possible bugs and can release a version providing verbose debug output (otherwise identical) if you want to give it a go.

mgloel · February 12, 2021, 10:10am

Hey @anton,
thanks again for your help.
We just realised that our enriched bucket did not have the subfolder archive/
Sorry, it was not obvious from the screenshot I attached to my initial post but checking the paths one more time was a really good idea. Thanks a lot!

The data is ending up in the sp-shredded/bad folder, though. With the following SchemaCriterion error:

[{"schemaCriterion":"iglu:com.oneapp/minimal_tracking_event/jsonschema/1-*-*","error": 
{"error":"ResolutionError","lookupHistory":[{"repository":"Iglu Client Embedded","errors":
[{"error":"NotFound"}],"attempts":1,"lastAttempt":"2021-02-12T14:35:22.949Z"},{"repository":"S3-
schemas-registry","errors":[{"error":"NotFound"}],"attempts":4,"lastAttempt":"2021-02-
12T14:35:25.286Z"}]}}],"payload":

We are hosting our schemas in a S3 bucket and are using the same resolver.json as for the enrichment.

Topic		Replies	Views
RDB Shredder step fails in Dataflow Runner AWS real-time pipeline	4	1134	May 19, 2021
S3distcp s3 access denied error dataflow runner For engineers	2	837	February 9, 2021
Converting from emrEtlRunner to DataflowRunner example? AWS real-time pipeline	8	2386	October 22, 2018
R35 Shredder - no shredding_complete.json file created Storage targets	8	1844	January 3, 2022
RDB Shredder 1.0.0 Iglu Config Error Troubleshooting	6	1234	May 28, 2021

R35 Shredder - no data in shredded bucket

Related topics