An error occurred while processing EMR cluster data (RDB Transformer). Namely:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2684.1 in stage 4.0 (TID 7538) can not write to output file: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://bucket-name/directory_name/run=2023-05-24-13-01-43/output=good/vendor=com.snowplowanalytics.mobile/name=application/format=tsv/model=1/part-02684-62c861b9-eda4-4c36-b3d9-0d3b9d28870e.c000.txt.gz
I am using RDB-transformer version: s3://snowplow-hosted-assets/4-storage/transformer-batch/snowplow-transformer-batch-4.2.0.jar
I got this error 1 time before. For several months of operation of Snowplow Streaming.
How do I get rid of the error?
I delete the unsuccessful transformed directory and run the EMR cluster on the same data. After that, the error is not reproduced.
With what the error can be connected? Is it possible to make the error no longer reproduce?