I’ve encountered a couple of “missing SQS message” and “shredding_complete.json” topics on this discourse but they’re quite old and none have that specific observation so I thought I’d create a new one.
I’m running Batch Transform on EMR (Spark). I’m using
snowplow-transformer-batch-5.0.0. Whenever the transformation step has multiple application attempts, even if the transformation completes, the “shredding_complete.json” file isn’t created and no message is sent to SQS.
Note that when the application runs in a single attempt, everything works fine, we get the message in SQS and the “shredding_complete.json” file so I would rule out AWS permissions.
I’ve been digging in all kinds of logs from EMR, containers etc … I couldn’t find any log about SQS or “shredding_complete.json”.
If anyone has any idea or any indication for where I could look at to find the problem, that would be amazing. I find it quite rough to navigate through Spark and EMR logs so even if you could indicate where the logs of the “sending SQS message” part would be located, that would be very useful.
Thanks a lot!
Please can you explain what you mean by “the transformation step has multiple application attempts”?
I’m going to guess you mean this sequence of events:
- You run the EMR job, it produces a few output files, and then ultimately ends in a failure
- You re-run the EMR job for the same batch. The EMR job ends with a success, but there is no SQS message.
What’s going on here is… step 1 of that sequence has produced what we call “incomplete shredding” (or incomplete transformation). Meaning it produced a directory with some output files, but no “shredding_complete.json”.
When the batch transformer next runs (step 2), it is designed to find and process only completely fresh batches; it skips over any batch that is either fully complete or incompletely shredded. So in this case, step 2 appears to end in success but it hasn’t actually done anything. It did not find any fresh batch, so it did not process any events.
The solution is… in between steps 1 and 2, you should completely delete the output directory of the incomplete batch. This way, when the transformer next runs, it discovers the batch as a fresh new batch.
I hope I’ve understood your problem correctly. But if you think your problem is something else, then I’m happy to take another go and troubleshooting it!
I think what you described applies yes.
But what I mean by “application attempt” is that within the same EMR run, you have different steps. Each step runs an application. In our case, the transform step runs the event transformation in Spark.
Now I’m not too familiar with this but, within Spark, I think there’s a retry mechanism for when an application fails. When this happens, the EMR run is marked as successful but we’re missing the “shredding_complete” thing.
See this screenshot of the Spark History Server UI:
So I think what you’re saying applies here but it’s happening at the Spark level, where I can’t just delete the output directory in between 2 retries.
Alright so I need to make sure that Spark completes properly in the first attempt. I’m really bad with those Spark memory configurations but I’ll try things out. I found the cheat sheet for Spark config but couldn’t land good results with it. Will keep trying.
@Timmycarbone You might find this blog helpful.