Reprocessing bad rows from the Clojure Collector

Timmycarbone · May 9, 2016, 4:16pm

Hey!

I’d like to reprocess a bunch of bad rows that I collected with the Clojure collector.

I read the JSON for every bad row that I want to replay and extracted the line value.
I modified the faulty content in the line value and wrote every repaired row in a new file, in another bucket.
NB: I’m working in Python.

So I have a new bucket with a reprocessing file:
s3://reprocessing/2016-05-09/random-name0123456

I changed the config-repro.yml file to have the in pointing to my reprocessing files in s3://reprocessing/2016-05-09.

First, the staging was silently failing. I changed the log format in config-repro.yml from tomcat-clj to cloudfront and the staging step was successful.

But now, I encounter another error, during the EMR flow:

Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-17-210.ec2.internal:8020/tmp/27690cd4-f2f6-47cf-aa16-b623325f4bd3/files

I read about it and most times, this is caused by the in/processing bucket being empty. But mine is not, the staging was successful and I can see the file in the processing “folder”.

I’ve noticed that Clojure logs format (syntax) is way different than what the line value in the JSON of bad rows. I’m wondering if this is the cause of my issues and how I should handle this. Should I try to rebuild a log file with the same syntax as Clojure’s ?

Else, does anyone has any idea of what’s happening?

alex · May 9, 2016, 10:06pm

Hi @timmycarbone,

Ah - the logic for moving Clojure collector files from in to staging is rather involved; that’s probably why it was breaking when you attempted to rerun from the beginning. Your attempted fix (changing from tomcat-clj to cloudfront for the logfile format) appeared to work because the logic for moving CloudFront collector files into staging is much simpler - but it then broke down in EMR because CloudFront and Clojure collector have fundamentally different logfile formats.

The correct fix (it’s almost impossible to know this) is to copy your extracted+fixed Clojure log lines straight into processing, and then run the pipeline with --skip staging.

Let us know if that works for you!

BTW - we are working on an update to our Hadoop Bad Rows job, which will let you write a piece of arbitrary JavaScript to fix bad source lines. It should make this sort of processing a lot quicker…

Timmycarbone · May 10, 2016, 10:28am

Hey @alex!

It worked! Thank you for the tips and explanation!

Eager to see what the Hadoop Bad Rows job will allow us to do!

Have a great day!

alex · May 10, 2016, 10:45am

Glad it worked! Stay tuned for the Hadoop Bad Rows release hopefully within the fortnight…

Topic		Replies	Views
Reprocessing Events from Clojure collector Troubleshooting	2	1231	August 30, 2018
Split lines from clojure collector AWS batch pipeline (Legacy)	0	1228	August 7, 2017
Python script to reprocess "bad" rows Enrichment	10	3407	June 23, 2016
Solved: Enrich Bad error - Access log TSV line contained X fields, expected Y Troubleshooting	2	1056	May 14, 2020
Enrich problem: "Error writing row" AWS batch pipeline (Legacy)	0	1824	April 19, 2018

Reprocessing bad rows from the Clojure Collector

Related topics