Rdb-transformer fails when writing to file

Timofey_Sorokin · August 16, 2023, 12:09pm

Hi! An error occurred while processing EMR cluster data (RDB Transformer).
RDB Transformer - step type: Spark submit , log stderr:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2322.1 in stage 4.0 (TID 6646) can not write to output file: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://tradingview-snowplow-streaming/transformed/run=2023-08-15-23-47-38/output=good/vendor=com.snowplowanalytics.mobile/name=application/format=tsv/model=1/part-02322-d560b783-09cc-4af9-bed3-638948ee1fb6.c000.txt.gz

I am using RDB-transformer version: s3://snowplow-hosted-assets/4-storage/transformer-batch/snowplow-transformer-batch-4.2.0.jar

The error has already happened 3-4 times in a few months.
During the error, the shredding_complete.json file is also not created.

How do I get rid of the error?
I delete the unsuccessful transformed directory and run the EMR cluster on the same data. EMR successfully processes data after restart.

With what the error can be connected? Is it possible to make the error no longer reproduce?

Topic		Replies	Views
Rdb-transformer can not write to output file Troubleshooting	0	727	May 24, 2023
Transformer batch 4.2.1 throws InvalidInputException Troubleshooting	1	747	August 18, 2022
Snowflake transformer fails in EMR step Troubleshooting	3	1817	December 10, 2020
Snowflake Transformer fails Storage targets	3	966	March 4, 2021
RDB Transformer 5.7.4 and Java 11 on emr 6.2.0 Troubleshooting	0	96	June 12, 2024

Rdb-transformer fails when writing to file

Related topics