Hey,
we setup the dataflow runner with the two jobs:
- S3DistCp and
- Shredder
We followed this documentation:
R35 Upgrade Guide - Snowplow Docs
The first job works fine and the data in our enriched bucket looks like this:
The second job finishes without an error but there is no data on our s3 bucket s3://sp-shredded/bad and s3://sp-shredded/good.
Our playbook.json and config.hocon follow the sample from the docs:
{
"schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
"data": {
"region": "eu-west-1",
"credentials": {
"accessKeyId": "AWS_ACCESS_KEY_ID",
"secretAccessKey": "AWS_SECRET_ACCESS_KEY"
},
"steps": [
{
"type": "CUSTOM_JAR",
"name": "S3DistCp enriched data archiving",
"actionOnFailure": "CANCEL_AND_WAIT",
"jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
"arguments": [
"--src", "SP_LOADER_URI",
"--dest", "SP_ENRICHED_URI/run={{nowWithFormat "2006-01-02-15-04-05"}}/",
"--srcPattern", ".*",
"--outputCodec", "gz"
]
},
{
"type": "CUSTOM_JAR",
"name": "RDB Shredder",
"actionOnFailure": "CANCEL_AND_WAIT",
"jar": "command-runner.jar",
"arguments": [
"spark-submit",
"--class", "com.snowplowanalytics.snowplow.shredder.Main",
"--master", "yarn",
"--deploy-mode", "cluster",
"s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar",
"--iglu-config", "{{base64File "resolver.json"}}",
"--config", "{{base64File "config.hocon"}}"
]
}
],
"tags": [ ]
}
}
config.hocon
{
"name": "myapp",
"id": "123e4567-e89b-12d3-a456-426655440000",
"region": "eu-west-1",
"messageQueue": "messages.fifo",
"shredder": {
"input": "SP_ENRICHED_URI",
"output": "SP_SHREDDED_GOOD_URI",
"outputBad": "SP_SHREDDED_BAD_URI",
"compression": "GZIP"
},
"formats": {
"default": "TSV",
"json": [ ],
"tsv": [ ],
"skip": [ ]
},
"storage" = {
"type": "redshift",
"host": "redshift.amazon.com",
"database": "snowplow",
"port": 5439,
"roleArn": "arn:aws:iam::123456789012:role/RedshiftLoadRole",
"schema": "atomic",
"username": "storage-loader",
"password": "secret",
"jdbc": {"ssl": true},
"maxError": 10,
"compRows": 100000
},
"steps": ["analyze"],
"monitoring": {
"snowplow": null,
"sentry": null
}
}
The envs are set like this:
SP_ENRICHED_URI: s3://sp-enriched-stage
SP_SHREDDED_GOOD_URI: s3://sp-shredded-stage/good/
SP_SHREDDED_BAD_URI: s3://sp-shredded-stage/bad/
The shredder log did not provide any obvious info of what might have gone wrong:
21/02/11 18:09:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/02/11 18:09:00 WARN DependencyUtils: Skip remote jar s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar.
21/02/11 18:09:01 INFO RMProxy: Connecting to ResourceManager at ip-11-222-59-27.eu-west-1.compute.internal/11.222.59.27:8032
21/02/11 18:09:01 INFO Client: Requesting a new application from cluster with 1 NodeManagers
21/02/11 18:09:01 INFO Configuration: resource-types.xml not found
21/02/11 18:09:01 INFO ResourceUtils: Unable to find 'resource-types.xml'.
21/02/11 18:09:01 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
21/02/11 18:09:01 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
21/02/11 18:09:01 INFO Client: Setting up container launch context for our AM
21/02/11 18:09:01 INFO Client: Setting up the launch environment for our AM container
21/02/11 18:09:01 INFO Client: Preparing resources for our AM container
21/02/11 18:09:01 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
21/02/11 18:09:03 INFO Client: Uploading resource file:/mnt/tmp/spark-3459d99c-3757-4ddf-b373-f461a5090dd8/__spark_libs__5331094724240217805.zip -> hdfs://ip-11-222-59-27.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1613066684194_0002/__spark_libs__5331094724240217805.zip
21/02/11 18:09:05 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
21/02/11 18:09:05 INFO Client: Uploading resource s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar -> hdfs://ip-11-222-59-27.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1613066684194_0002/snowplow-rdb-shredder-0.19.0.jar
21/02/11 18:09:06 INFO S3NativeFileSystem: Opening 's3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.19.0.jar' for reading
21/02/11 18:09:09 INFO Client: Uploading resource file:/mnt/tmp/spark-3459d99c-3757-4ddf-b373-f461a5090dd8/__spark_conf__5236745401879396821.zip -> hdfs://ip-11-222-59-27.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1613066684194_0002/__spark_conf__.zip
21/02/11 18:09:09 INFO SecurityManager: Changing view acls to: hadoop
21/02/11 18:09:09 INFO SecurityManager: Changing modify acls to: hadoop
21/02/11 18:09:09 INFO SecurityManager: Changing view acls groups to:
21/02/11 18:09:09 INFO SecurityManager: Changing modify acls groups to:
21/02/11 18:09:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
21/02/11 18:09:09 INFO Client: Submitting application application_1613066684194_0002 to ResourceManager
21/02/11 18:09:09 INFO YarnClientImpl: Submitted application application_1613066684194_0002
21/02/11 18:09:10 INFO Client: Application report for application_1613066684194_0002 (state: ACCEPTED)
21/02/11 18:09:10 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1613066949805
final status: UNDEFINED
tracking URL: http://ip-11-222-59-27.eu-west-1.compute.internal:20888/proxy/application_1613066684194_0002/
user: hadoop