Rdb_load failing at analyze step

thedstrom · August 27, 2020, 5:56pm

Some context about my setup: I am running EmrEtlRunner r119 in stream enrich mode on a persistent cluster loading into Redshift. It is running every 20 minutes with a lock to ensure only one run at a time. Runs are failing at the rdb_load step at least once or twice a week with this error:

   ERROR: Data loading error [Amazon](500310) Invalid operation: could not complete because of conflict with concurrent transaction;
    Following steps completed: [Discover,Load]

Discover and Load always complete and the transaction error occurs at the Analyze stage.

I have been unable to find the cause of the transaction error. The Redshift console reports all queries completed successfully and STL_TR_CONFLICT is empty. The best I could find is this AWS thread that says that queries that complete in less than a second may not be logged.

Is this an error anyone has run into before? More specifically: what exactly is happening in the analyze stage? What is being run that could have a transaction failure and is there a way for me to get more detailed logs on the progress of the rdb_load step?

ihor · August 27, 2020, 10:45pm

@thedstrom, if you cannot figure out what your pipeline clashes with, you can skip ANALYZE step with --skip analyze option as per wiki.

thedstrom · August 27, 2020, 11:34pm

@ihor Thank you for the quick response! I would like to understand potential side-effects of skipping the analyze step. Does analyze just run Redshift’s ANALYZE on each of the Snowplow tables in the output schema? Does it perform any other actions? Apologies if there is documentation on this, the best I was able to find is this high level description.

ihor · August 28, 2020, 1:02am

@thedstrom, nothing extraordinary about ANALYZE executed as part of EER job. It is the standard ANALYZE execution on the tables in your schema as defined in your target configuration file.

You might be running your own maintenance job (or scheduled by AWS) that does VACUUM and ANALYZE independently of your pipeline (EER execution). That could be sufficient and therefore safe to skip in EER.

mike · August 28, 2020, 2:27am

Redshift will for the most part auto ANALYZE in the background - so if you are only doing incremental loads the table statistics will mostly update for you. If you are loading a significant amount of data (or making structural changes to the table) it’s sometimes advisable to force ANALYZE to get table stats up to date.

So in most circumstances you can simply --skip analyze without incurring too much downside.

thedstrom · August 28, 2020, 8:30pm

Thanks so much everyone! I will start skipping ANALYZE and see how things go

Topic		Replies	Views
RDB Loader step getting failed after loading few events Storage targets	2	932	October 29, 2020
RDB Loader 0.18.1: “[Amazon][JDBC](10040) Cannot use commit while Connection is in auto-commit mode.” Storage targets	3	2067	April 29, 2021
RDB loader fails after load Troubleshooting	2	1627	April 21, 2021
R90 storage loading problems Troubleshooting	9	2287	October 19, 2017
EMR to Redshift Error For engineers	5	2520	February 23, 2020

Rdb_load failing at analyze step

Related topics