We have had a similar issue, although mostly with the Hadoop Shred job - we raised it with AWS and heard back:
I’ve been looking at the clusters that you sent with another engineer
here and we believe that this is an instance-controller issue. There are
certain rare situations if there’s a long process name, it can break
instance-controller which looks like what’s happening here. A couple of
your command includes a very long argument and we believe that’s what’s
causing the instability. This exists in emr-4.3.0 and older.
We have been upgrading jobs to emr-4.5.0 and so far so good - incidence of this problem seem to have declined. Can you try the same on your side and report back?