I’m attempting to reprocess bad events following the steps outlined here and load them to Redshift.
I’ve successfully reprocessed the bad events, with the results sent to the bucket s3://q-snowplow-recovered/recovered
within that bucket, there are 3 files: _SUCCESS
, part-00000
, and part-00001
.
I’ve updated the config file for the EmrEtlRunner to the following:
raw:
in: # Multiple in buckets are permitted
- s3://q-snowplow-does-not-exist # IGNORED e.g. s3://my-in-bucket
processing: s3://q-snowplow-recovered/recovered
When I then attempt to run the EmrEtlRunner as follows (hiding the full paths for brevity):
snowplow-emr-etl-runner --config config-recovery.yml --resolver iglu_resolver.json --enrichments enrichments --skip staging
I get the following errors during the Elasticity S3DistCp Step: Raw S3 -> HDFS
step of the EMR job:
2017-08-21 18:20:14,146 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Running with args: --src s3://q-snowplow-recovered/recovered/ --dest hdfs:///local/snowplow/raw-events/ --s3Endpoint s3.amazonaws.com --groupBy .*\.([0-9]+-[0-9]+-[0-9]+)-[0-9]+\..* --targetSize 128 --outputCodec lzo
2017-08-21 18:20:16,156 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): S3DistCp args: --src s3://q-snowplow-recovered/recovered/ --dest hdfs:///local/snowplow/raw-events/ --s3Endpoint s3.amazonaws.com --groupBy .*\.([0-9]+-[0-9]+-[0-9]+)-[0-9]+\..* --targetSize 128 --outputCodec lzo
2017-08-21 18:20:33,157 INFO com.amazon.ws.emr.hadoop.fs.EmrFileSystem (main): Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
2017-08-21 18:20:33,961 INFO amazon.emr.metrics.MetricsSaver (main): MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1503339465506
2017-08-21 18:20:33,962 INFO amazon.emr.metrics.MetricsSaver (main): Created MetricsSaver j-LIBQ62MBKWUX:i-032d6420d929c9538:RunJar:06282 period:60 /mnt/var/em/raw/i-032d6420d929c9538_20170821_RunJar_06282_raw.bin
2017-08-21 18:20:36,908 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Using output path 'hdfs:/tmp/711f7e02-88b7-4d88-8568-b06119b03d32/output'
2017-08-21 18:20:37,879 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): DefaultAWSCredentialsProviderChain is used to create AmazonS3Client. KeyId: ASIAIXI7OEIJZACG2PFQ
2017-08-21 18:20:37,879 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): AmazonS3Client setEndpoint s3.amazonaws.com
2017-08-21 18:20:38,166 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Skipping key 'recovered/' because it ends with '/'
2017-08-21 18:20:38,166 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Created 0 files to copy 0 files
2017-08-21 18:20:38,263 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Reducer number: 10
2017-08-21 18:20:38,964 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-40-121.ec2.internal/172.31.40.121:8032
2017-08-21 18:20:42,294 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1503339444086_0001
2017-08-21 18:20:42,321 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Try to recursively delete hdfs:/tmp/711f7e02-88b7-4d88-8568-b06119b03d32/tempspace
I’ve tried running with several variants in the config raw.processing
to no avail:
s3://q-snowplow-recovered/recovered
s3://q-snowplow-recovered/recovered/
s3://q-snowplow-recovered
s3://q-snowplow-recovered/
Thanks in advance for your help!