Hey!
I’d like to reprocess a bunch of bad rows that I collected with the Clojure collector.
I read the JSON for every bad row that I want to replay and extracted the line
value.
I modified the faulty content in the line
value and wrote every repaired row in a new file, in another bucket.
NB: I’m working in Python.
So I have a new bucket with a reprocessing file:
s3://reprocessing/2016-05-09/random-name0123456
I changed the config-repro.yml
file to have the in
pointing to my reprocessing files in s3://reprocessing/2016-05-09
.
First, the staging was silently failing. I changed the log format in config-repro.yml
from tomcat-clj
to cloudfront
and the staging step was successful.
But now, I encounter another error, during the EMR flow:
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-17-210.ec2.internal:8020/tmp/27690cd4-f2f6-47cf-aa16-b623325f4bd3/files
I read about it and most times, this is caused by the in/processing bucket being empty. But mine is not, the staging was successful and I can see the file in the processing “folder”.
I’ve noticed that Clojure logs format (syntax) is way different than what the line value in the JSON of bad rows. I’m wondering if this is the cause of my issues and how I should handle this. Should I try to rebuild a log file with the same syntax as Clojure’s ?
Else, does anyone has any idea of what’s happening?