Reprocessing bad rows from the Clojure Collector


I’d like to reprocess a bunch of bad rows that I collected with the Clojure collector.

I read the JSON for every bad row that I want to replay and extracted the line value.
I modified the faulty content in the line value and wrote every repaired row in a new file, in another bucket.
NB: I’m working in Python.

So I have a new bucket with a reprocessing file:

I changed the config-repro.yml file to have the in pointing to my reprocessing files in s3://reprocessing/2016-05-09.

First, the staging was silently failing. I changed the log format in config-repro.yml from tomcat-clj to cloudfront and the staging step was successful.

But now, I encounter another error, during the EMR flow:

Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-17-210.ec2.internal:8020/tmp/27690cd4-f2f6-47cf-aa16-b623325f4bd3/files

I read about it and most times, this is caused by the in/processing bucket being empty. But mine is not, the staging was successful and I can see the file in the processing “folder”.

I’ve noticed that Clojure logs format (syntax) is way different than what the line value in the JSON of bad rows. I’m wondering if this is the cause of my issues and how I should handle this. Should I try to rebuild a log file with the same syntax as Clojure’s ?

Else, does anyone has any idea of what’s happening?

Hi @timmycarbone,

Ah - the logic for moving Clojure collector files from in to staging is rather involved; that’s probably why it was breaking when you attempted to rerun from the beginning. Your attempted fix (changing from tomcat-clj to cloudfront for the logfile format) appeared to work because the logic for moving CloudFront collector files into staging is much simpler - but it then broke down in EMR because CloudFront and Clojure collector have fundamentally different logfile formats.

The correct fix (it’s almost impossible to know this) is to copy your extracted+fixed Clojure log lines straight into processing, and then run the pipeline with --skip staging.

Let us know if that works for you!

BTW - we are working on an update to our Hadoop Bad Rows job, which will let you write a piece of arbitrary JavaScript to fix bad source lines. It should make this sort of processing a lot quicker…

Hey @alex!

It worked! Thank you for the tips and explanation! :slight_smile:

Eager to see what the Hadoop Bad Rows job will allow us to do!

Have a great day!

Glad it worked! Stay tuned for the Hadoop Bad Rows release hopefully within the fortnight…