Hi! An error occurred while processing EMR cluster data (RDB Transformer).
RDB Transformer - step type: Spark submit , log stderr:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2322.1 in stage 4.0 (TID 6646) can not write to output file: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://tradingview-snowplow-streaming/transformed/run=2023-08-15-23-47-38/output=good/vendor=com.snowplowanalytics.mobile/name=application/format=tsv/model=1/part-02322-d560b783-09cc-4af9-bed3-638948ee1fb6.c000.txt.gz
I am using RDB-transformer version: s3://snowplow-hosted-assets/4-storage/transformer-batch/snowplow-transformer-batch-4.2.0.jar
The error has already happened 3-4 times in a few months.
During the error, the shredding_complete.json file is also not created.
How do I get rid of the error?
I delete the unsuccessful transformed directory and run the EMR cluster on the same data. EMR successfully processes data after restart.
With what the error can be connected? Is it possible to make the error no longer reproduce?