RDB loader wrong file path

fwahlqvist · March 15, 2021, 7:45pm

In the process of setting up RDB loader(post R35) and got stuck on the shredding part.
In my error log i can se that they file location is wrong

ERROR Client: Application diagnostics message: User class threw exception: java.io.IOException: Not a file: s3://XXX/enriched/archive/run=2021-03-15-18-29-34/2021
as reads the year as well. However not sure where i change this i have

Made sure the date format is correct and after a enriched run i got file in the following structure…
“s3:/xxx/enriched/archive/run=2021-03-15-18-29-34/2021/03/15/”
I don’t use any custom dateFormat in s3 sink

config.hocon looks like

{
  "name": "{{client}}",
  "id": "24cda775-ea2d-4cfd-b4f8-b580670cb465",

  "region": "{{aws_region}}",
  "messageQueue": "{{fifo_que}}",

  "shredder": {
    "input": "s3://{{s3_shredded}}/enriched/archive/",
    "output": "s3://{{s3_shredded}}/good/",
    "outputBad": "s3://{{s3_shredded}}/bad/",
    "compression": "GZIP"
  },

  "formats": {
    "default": "TSV",
    "json": {{shredded_as_jsons}},
    "tsv": {{shredded_as_tsvs}},
    "skip": {{skip_schemas}}
  },

  "storage": {
    "type": "redshift",
    "host": "{{redshift_hostname}}",
    "database": "{{snowplow_database_name}}",
    "port": {{db_port}},
    "roleArn": "{{roleArn}}",
    "schema": "{{schema_name}}",
    "username": "{{username}}",
    "password": "{{password}}",
    "jdbc": {"ssl": true},
    "maxError": 10,
    "compRows": 100000
  },

  "steps": {{steps}},

  monitoring = {
    "snowplow": {
    "collector": "{{collectorUri}}"
    "appId": "{{appName}}"
    method:"get"
    },
    "sentry": null
  }
}

and playbook.json is

    {
      "schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
      "data": {
        "region": "{{aws_region}}",
        "credentials": {
          "accessKeyId": "default",
          "secretAccessKey": "default"
        },
        "steps": [
          {
            "type": "CUSTOM_JAR",
            "name": "S3DistCp enriched data archiving",
            "actionOnFailure": "CANCEL_AND_WAIT",
            "jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
            "arguments": [
                "--src", "s3://{{s3_enriched_bucket}}/enriched/good/",
                "--dest", "s3://{{s3_shredded}}/enriched/archive/run={{nowWithFormat "2006-01-02-15-04-05"}}/",
                "--s3Endpoint", "s3-{{aws_region}}.amazonaws.com",
                "--srcPattern", ".*",
                "--outputCodec", "gz",
                "--deleteOnSuccess"
            ]
          },

          {
            "type": "CUSTOM_JAR",
            "name": "RDB Shredder",
            "actionOnFailure": "CANCEL_AND_WAIT",
            "jar": "command-runner.jar",
            "arguments": [
                "spark-submit",
                "--class", "com.snowplowanalytics.snowplow.shredder.Main",
                "--master", "yarn",
                "--deploy-mode", "cluster",
                "s3://snowplow-hosted-assets-{{aws_region}}/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar",
                "--iglu-config", "{{base64File "/config/iglu_resolver.json"}}",
                "--config", "{{base64File "/config/config.hocon"}}"
            ]
          }
        ],
        "tags": [ ]
      }
    }

Input welcome.
Best
f

egor · March 16, 2021, 5:07am

Hi @fwahlqvist,

It appears the location isn’t correct. I would rather expect .../archive/run=2021-03-15-18-29-34 instead of .../archive/run=2021-03-15-18-29-34/2021/03/15/. Could you try to relocate the files and see if it helps?

Thanks,

fwahlqvist · March 16, 2021, 8:30am

Hey @egor, thanks for getting back to me…

These are the current config for enrich and archive (according to docs) could you kindly point me to the direction on what i should update?

"arguments": [
                "--src", "s3://{{s3_enriched_bucket}}/enriched/good/",
                "--dest", "s3://{{s3_shredded}}/enriched/archive/run={{nowWithFormat "2006-01-02-15-04-05"}}/",
                "--s3Endpoint", "s3-{{aws_region}}.amazonaws.com",
                "--srcPattern", ".*",
                "--outputCodec", "gz",
                "--deleteOnSuccess"
            ]

Thanks
F

egor · March 16, 2021, 8:55am

@fwahlqvist,

These arguments look correctly. I think what happens here is that the shredder tries to find files with enriched data in run=YYYY-MM-DD-hh-mm-ss bucket but instead it finds a sub-bucket 2021 which in its turn has a few more sub-buckets.

It’s a bit unclear from where these sub-buckets come. I suppose that you might have an incorrect structure of files in the src bucket - in this one you should have only files and 0 sub-buckets. If you move all files from s3:/xxx/enriched/archive/run=2021-03-15-18-29-34/2021/03/15/ to s3://{{s3_enriched_bucket}}/enriched/good/ and ensure that there are no sub-bucket - it should likely work as expected.

Could you also clarify how you produce the enriched data and upload it on S3?

fwahlqvist · March 16, 2021, 10:14pm

Hey @egor
thanks for getting back to me

The folder structure of the data in the enriched is from S3 sink and as you say has the structure of yyyy/mm/dd/hh as per documentation

github.com

snowplow/snowplow-s3-loader/blob/master/examples/config.hocon.sample

# Default configuration for s3-loader

# Sources currently supported are:
# 'kinesis' for reading records from a Kinesis stream
# 'nsq' for reading records from a NSQ topic
source = "{{source}}"

# Sink is used for sending events which processing failed.
# Sinks currently supported are:
# 'kinesis' for writing records to a Kinesis stream
# 'nsq' for writing records to a NSQ topic
sink = "{{sink}}"

# The following are used to authenticate for the Amazon Kinesis sink.
# If both are set to 'default', the default provider chain is used
# (see http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html)
# If both are set to 'iam', use AWS IAM Roles to provision credentials.
# If both are set to 'env', use environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
aws {
  accessKey = "iam"

This file has been truncated. show original

However to get s3-sink to work with emr i needed a custom format so have now enabled the blow

  # optional date format prefix for directory pattern
  # eg: {YYYY}/{MM}/{dd}/{HH}
  # dateFormat = "{YYYY}/{MM}/{dd}/{HH}"

Have now updated it to use a custom date format.
dateFormat = “{YYYY}-{MM}-{dd}-{HH}”

(Hopefully useful for someone)

Best
F

egor · March 17, 2021, 5:45am

Thanks for the additional details and the hint, @fwahlqvist.

As mentioned in the example configuration for the S3 Loader, dateFormat is an optional argument. Based on your goals, you need to make a call whether it should be used or not, and which format to use.

If you are going to consume data directly from outputDirectory (e.g. via Athena) then it’s better to define this argument to take benefits or partitioned data. The same applies if you want to have a separate folder for every data dump and to have more control of run folders size.

If you are going to load the data into Redshift or Snowflake then this argument isn’t required (at least for the standard playbooks).

We will look if we can make it clearer. For this, I have passed your feedback over to the relevant team members.

fwahlqvist · March 17, 2021, 12:06pm

Thanks @egor,
to be clear I had to use a custom dateFormat for EMR as otherwise the default behaviour was to create folder structure with yyyy/mm/dd

However i can see the files being moved to a folder structure of

run=2021-03-17-11-43-42/

But they are not being shredded…
From the EMR step the first one succeed but second on get stuck in state running with the following error log and cluster.json config (ps instance type is m4.large)

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
  "data": {
    "name": "dataflow-runner - RDB Shredder",
    "logUri": "s3://{{emr_logs}}",
    "region": "{{aws_region}}",
    "credentials": {
      "accessKeyId": "default",
      "secretAccessKey": "default"
    },
    "roles": {
      "jobflow": "EMR_EC2_DefaultRole",
      "service": "EMR_DefaultRole"
    },
    "ec2": {
      "amiVersion": "6.1.0",
      "keyName": "XXXX",
      "location": {
        "vpc": {
          "subnetId": "{{subnets}}"
        }
      },
      "instances": {
        "master": {
          "type": "{{instance_type_master}}"
        },
        "core": {
          "type": "{{instance_type_core}}",
          "count": 1
        },
        "task": {
          "type": "{{instance_type_task}}",
          "count": 0,
          "bid": "0.0015"
        }
      }
    },
    "tags": [
      {
        "key": "client",
        "value": "com.engineering"
      },
      {
        "key": "job",
        "value": "main"
      }
    ],
    "bootstrapActionConfigs": [
      {
        "name": "Elasticity Bootstrap Action",
        "scriptBootstrapAction": {
          "path": "s3://snowplow-hosted-assets-eu-west-2/common/emr/snowplow-ami4-bootstrap-0.2.0.sh",
          "args": ["1.5"]
        }
      }
    ],
    "configurations": [
      {
        "classification": "core-site",
        "properties": {
          "Io.file.buffer.size": "65536"
        }
      },
      {
        "classification": "mapred-site",
        "properties": {
          "Mapreduce.user.classpath.first": "true"
        }
      }
    ],
    "applications": ["Hadoop", "Spark"]
  }
}



21/03/17 11:51:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/03/17 11:51:58 WARN DependencyUtils: Skip remote jar s3://snowplow-hosted-assets-eu-west-2/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar.
21/03/17 11:51:58 INFO RMProxy: Connecting to ResourceManager at ip-xxxx.eu-west-2.compute.internal/xxxx:8032
21/03/17 11:51:59 INFO Client: Requesting a new application from cluster with 1 NodeManagers
21/03/17 11:51:59 INFO Configuration: resource-types.xml not found
21/03/17 11:51:59 INFO ResourceUtils: Unable to find 'resource-types.xml'.
21/03/17 11:51:59 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (6144 MB per container)
21/03/17 11:51:59 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
21/03/17 11:51:59 INFO Client: Setting up container launch context for our AM
21/03/17 11:51:59 INFO Client: Setting up the launch environment for our AM container
21/03/17 11:51:59 INFO Client: Preparing resources for our AM container
21/03/17 11:51:59 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
21/03/17 11:52:02 INFO Client: Uploading resource file:/mnt/tmp/spark-6a2731aa-3229-4ad0-bc0d-b6b868197bf0/__spark_libs__513458978147608271.zip -> hdfs://ip-10-0-101-238.eu-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1615981741937_0002/__spark_libs__513458978147608271.zip
21/03/17 11:52:06 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
21/03/17 11:52:06 INFO Client: Uploading resource s3://snowplow-hosted-assets-eu-west-2/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar -> hdfs://ip-10-0-101-238.eu-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1615981741937_0002/snowplow-rdb-shredder-0.19.0.jar
21/03/17 11:52:07 INFO S3NativeFileSystem: Opening 's3://snowplow-hosted-assets-eu-west-2/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar' for reading
21/03/17 11:52:10 INFO Client: Uploading resource file:/mnt/tmp/spark-6a2731aa-3229-4ad0-bc0d-b6b868197bf0/__spark_conf__6123137051064315275.zip -> hdfs://ip-10-0-101-238.eu-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1615981741937_0002/__spark_conf__.zip
21/03/17 11:52:10 INFO SecurityManager: Changing view acls to: hadoop
21/03/17 11:52:10 INFO SecurityManager: Changing modify acls to: hadoop
21/03/17 11:52:10 INFO SecurityManager: Changing view acls groups to: 
21/03/17 11:52:10 INFO SecurityManager: Changing modify acls groups to: 
21/03/17 11:52:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
21/03/17 11:52:10 INFO Client: Submitting application application_1615981741937_0002 to ResourceManager
21/03/17 11:52:10 INFO YarnClientImpl: Submitted application application_1615981741937_0002
21/03/17 11:52:11 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:11 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1615981930499
	 final status: UNDEFINED
	 tracking URL: http://ip-xxxx.eu-west-2.compute.internal:20888/proxy/application_1615981741937_0002/
	 user: hadoop
21/03/17 11:52:12 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:13 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:14 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:15 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:16 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:17 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:18 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:19 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:20 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:21 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:22 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:23 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:24 INFO Client: Application report for application_1615981741937_0002 (state: ACCEPTED)
21/03/17 11:52:25 INFO Client: Application report for application_1615981741937_0002 (state: RUNNING)
21/03/17 11:52:25 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: ip-xxxx.eu-west-2.compute.internal
	 ApplicationMaster RPC port: 34039
	 queue: default
	 start time: 1615981930499
	 final status: UNDEFINED
	 tracking URL: http://ip-xxxx.eu-west-2.compute.internal:20888/proxy/application_1615981741937_0002/
	 user: hadoop
21/03/17 11:52:26 INFO Client: Application report for application_1615981741937_0002 (state: RUNNING)
21/03/17 11:52:27 INFO Client: Application report for application_1615981741937_0002 (state: RUNNING)
21/03/17 11:52:28 INFO Client: Application report for application_1615981741937_0002 (state: RUNNING)

egor · March 18, 2021, 8:59am

Hi @fwahlqvist,

It’s likely the instance type is too small for the amount of data. If it’s the case you will need to use a bigger cluster (with more core nodes or with more powerful compute notes). There are some instruction here which can help you to find an optimal cluster’s size

How many runs do you have in .../enriched/archive/? What is the total size and size of each run bucket? It still might be useful to see a tree of these (to ensure it’s the expected structure now).

Best,

fwahlqvist · March 18, 2021, 10:34am

Hey @egor,
This is a test set up with 9 runs and about 10 rows of data in each run and it has m4.large so one would hope its not that …

BenB · April 21, 2021, 3:03pm

Hi @fwahlqvist,

Sorry for taking some time to get back to you.

We updated our docs website to make it clear that enriched events on S3 should not be partitioned by date.

We also added a diagram that explains the new architecture and we added details about the algorithm that shredder uses to infer which folders it needs to shred. Please have a look and let us know if there are things unclear.

We have released 1.0.0 which is latest production release.

get stuck in state running

Your configuration looks correct. What is Spark UI showing ? Is there work being done ? Can you show the content of stdout for the driver please ?

This is a test set up with 9 runs

You’re testing on a new env with only 10 folders in archive/enriched/ and nothing in shredded/ ?

Topic		Replies	Views
RDB shredder failed? For engineers	27	3160	January 5, 2022
R35 Shredder - no data in shredded bucket Storage targets	11	1283	February 12, 2021
EmrEtlRunner Config / environment variables not recognized	9	3728	February 26, 2021
RDB Shredder step fails in Dataflow Runner AWS real-time pipeline	4	1134	May 19, 2021
Shredding EMR spark config (IOException: All datanodes ... are bad) For engineers	13	2054	May 31, 2021

RDB loader wrong file path

Related topics