Storage loader taking more time

rahul · January 10, 2017, 11:29am

Hi,

From last few days we noticed that storage loader started to take long time to load data. Earlier, to load around 2-3 million records per day storage loader was taking around 20-35 minutes. But now we have started to track more data (around 7-9 million records per day).

Now since last few days storage loader is taking more than 1 hr to complete. What could be the reason for this?

alex · January 10, 2017, 12:27pm

Hi @rahul, it sounds like the load process is taking longer because you are loading more data. Which database are you loading your data into?

rahul · January 10, 2017, 12:30pm

Hi @alex,

Thanks for replying.

we are storing data to Redshift.

alex · January 10, 2017, 1:06pm

Right - the load process will likely speed up as you scale your Redshift cluster up with the increased event volumes…

rahul · January 10, 2017, 1:13pm

@alex

How about using redshift WLM query_slot_count? We can increase the slot counts for the copy commands.

alex · January 10, 2017, 1:32pm

Definitely worth trying - please report back here with your findings! Thanks @rahul

bernardosrulzon · January 13, 2017, 10:43am

@rahul I’d also appreciate your findings on changing query_slot_count!

@alex a related thought: the “move to archive” step takes quite a while to finish on out pipeline. do you see any downsides of running the ETL with --skip archive, then running the loader and archive steps in parallel?

alex · January 13, 2017, 12:33pm

Hey @bernardosrulzon:

Right - this is because the Sluice-based Ruby archiving code is very slow, especially if you have lots of different types of events and contexts.

The archive_raw part of this problem should be fixed in the next release:

Within the next couple of months the archive_enriched problem should go away too:

So rather than parallelize the steps, I would wait for R87, and then consider temporarily disabling archive_enriched and replacing it with a bash script using aws CLI, something like:

#!/bin/bash

# Fast fail
set -e


# -----------------------------------------------------------------------------
#  FUNCTIONS & PROCEDURES
# -----------------------------------------------------------------------------

# Similar to Perl die
function die() {
    echo "$@" 1>&2 ; exit 1;
}


# -----------------------------------------------------------------------------
#  ARGUMENT PARSING
# -----------------------------------------------------------------------------

[ "$#" -eq 0 ] || die "0 arguments expected, $# provided"


# -----------------------------------------------------------------------------
#  COMMANDS
# -----------------------------------------------------------------------------

ret_val=1
retries="0"
set +e
while [[ $ret_val -ne 0 && $retries -lt 4 ]]
do
    echo "Attempting to archive enriched events (ret_val: ${ret_val}; retries: ${retries})"
    /usr/local/bin/aws --profile my-aws-profile s3 mv s3://snowplow-my-enrichment-output/main/enriched/good/ s3://snowplow-my-enrichment-archive/main/enriched/good/ --recursive --include "*.*"
    ret_val=$?
    retries=$[$retries+1]
done
if [ $ret_val -ne 0 ]; then
    echo "Error running archiving enriched events, exiting with return code ${ret_val}."
    exit $ret_val
fi
set -e

ret_val=1
retries="0"
set +e
while [[ $ret_val -ne 0 && $retries -lt 4 ]]
do
    echo "Attempting to archive enriched events (ret_val: ${ret_val}; retries: ${retries})"
    /usr/local/bin/aws --profile my-aws-profile s3 mv s3://snowplow-my-enrichment-output/main/shredded/good/ s3://snowplow-my-enrichment-archive/main/shredded/good/ --recursive --include "*.*"
    ret_val=$?
    retries=$[$retries+1]
done
if [ $ret_val -ne 0 ]; then
    echo "Error running archiving shredded events, exiting with return code ${ret_val}."
    exit $ret_val
fi
set -e

bernardosrulzon · January 13, 2017, 4:00pm

That makes a lot of sense - thanks!!

bernardosrulzon · February 2, 2017, 9:50pm

@alex @rahul Increasing the WLM query_slot_count didn’t have a significant impact here. It seems that all COPY queries are run in series through a giant transaction.

I guess it makes sense to split each COPY into its own transaction, and perform the bucket moves after each table has been loaded (to avoid duplication in case of failure). Thoughts?

mike · February 2, 2017, 10:09pm

I suspect the logic behind wrapping multiple COPYs in a single transaction is to avoid running into a scenario where a partial data load occurs. For example, in a single transaction scenario if a table of shredded events failed to load for some reason but atomic.events succeeded the transaction would fail. In the instance of individual transactions the atomic.events load could complete while the shredded table could fail.

bernardosrulzon · February 3, 2017, 9:39am

@mike but that wouldn’t be a problem as long as the successful transactions had their directories moved to archive in s3, right? one would just have to rerun storageloader after fixing whatever went wrong.

bernardosrulzon · March 10, 2017, 3:08pm

@alex @mike We can start working on a PR to individualize COPY statements to take advantage of paralellism, if it makes sense to you. The bash script helped a lot, but storageloader is still taking quite a while to run.

The idea is that each COPY statement would be immediately followed by a script to move the corresponding files to archive

alex · March 11, 2017, 2:21am

Hey @bernardosrulzon - I think parallelizing the COPY commands is a really interesting idea. I don’t think file moves are a good way of keeping track of all this though - they are far too brittle, and it will be hard to reason about (and recover) when individual table loads / file archives fail.

Did you see the new load manifest in R87?

It could be interesting to extend this to track loads right down to the per shredded table basis; the next step would be to use this table to actually drive which are the next table:run pairs to load.

I think the challenge with a PR right now is that there are major changes afoot in the StorageLoader which would be hard to synchronize with, including:

Hopefully within the next two months the StorageLoader codebase should have settled down again - and ought to be much more accessible then to boot! Perhaps it’s worth waiting till then for a PR?

bernardosrulzon · March 16, 2017, 2:32pm

Sound good, thanks!

Topic		Replies	Views
Snowplow-rdb-loader timing out For engineers	2	740	March 5, 2020
ETL Shred step taking longer and longer AWS batch pipeline (Legacy)	24	3322	March 30, 2017
Processing folder not empty - but no error on the ETL script! Enrichment	4	1344	June 14, 2017
RDB Loader can hang for many hours Troubleshooting	3	1457	September 22, 2017
Storage loader failure For engineers	3	1216	November 22, 2016

Storage loader taking more time

Related topics