Rdb_load failing at analyze step

Some context about my setup: I am running EmrEtlRunner r119 in stream enrich mode on a persistent cluster loading into Redshift. It is running every 20 minutes with a lock to ensure only one run at a time. Runs are failing at the rdb_load step at least once or twice a week with this error:

   ERROR: Data loading error [Amazon](500310) Invalid operation: could not complete because of conflict with concurrent transaction;
    Following steps completed: [Discover,Load]

Discover and Load always complete and the transaction error occurs at the Analyze stage.

I have been unable to find the cause of the transaction error. The Redshift console reports all queries completed successfully and STL_TR_CONFLICT is empty. The best I could find is this AWS thread that says that queries that complete in less than a second may not be logged.

Is this an error anyone has run into before? More specifically: what exactly is happening in the analyze stage? What is being run that could have a transaction failure and is there a way for me to get more detailed logs on the progress of the rdb_load step?

@thedstrom, if you cannot figure out what your pipeline clashes with, you can skip ANALYZE step with --skip analyze option as per wiki.

1 Like

@ihor Thank you for the quick response! I would like to understand potential side-effects of skipping the analyze step. Does analyze just run Redshift’s ANALYZE on each of the Snowplow tables in the output schema? Does it perform any other actions? Apologies if there is documentation on this, the best I was able to find is this high level description.

@thedstrom, nothing extraordinary about ANALYZE executed as part of EER job. It is the standard ANALYZE execution on the tables in your schema as defined in your target configuration file.

You might be running your own maintenance job (or scheduled by AWS) that does VACUUM and ANALYZE independently of your pipeline (EER execution). That could be sufficient and therefore safe to skip in EER.

2 Likes

Redshift will for the most part auto ANALYZE in the background - so if you are only doing incremental loads the table statistics will mostly update for you. If you are loading a significant amount of data (or making structural changes to the table) it’s sometimes advisable to force ANALYZE to get table stats up to date.

So in most circumstances you can simply --skip analyze without incurring too much downside.

1 Like

Thanks so much everyone! I will start skipping ANALYZE and see how things go