Storage loader taking more time

Hi,

From last few days we noticed that storage loader started to take long time to load data. Earlier, to load around 2-3 million records per day storage loader was taking around 20-35 minutes. But now we have started to track more data (around 7-9 million records per day).

Now since last few days storage loader is taking more than 1 hr to complete. What could be the reason for this?

Hi @rahul, it sounds like the load process is taking longer because you are loading more data. Which database are you loading your data into?

Hi @alex,

Thanks for replying.

we are storing data to Redshift.

Right - the load process will likely speed up as you scale your Redshift cluster up with the increased event volumes…

@alex

How about using redshift WLM query_slot_count? We can increase the slot counts for the copy commands.

Definitely worth trying - please report back here with your findings! Thanks @rahul

@rahul I’d also appreciate your findings on changing query_slot_count!

@alex a related thought: the “move to archive” step takes quite a while to finish on out pipeline. do you see any downsides of running the ETL with --skip archive, then running the loader and archive steps in parallel?

Hey @bernardosrulzon:

Right - this is because the Sluice-based Ruby archiving code is very slow, especially if you have lots of different types of events and contexts.

The archive_raw part of this problem should be fixed in the next release:

Within the next couple of months the archive_enriched problem should go away too:

So rather than parallelize the steps, I would wait for R87, and then consider temporarily disabling archive_enriched and replacing it with a bash script using aws CLI, something like:

#!/bin/bash

# Fast fail
set -e


# -----------------------------------------------------------------------------
#  FUNCTIONS & PROCEDURES
# -----------------------------------------------------------------------------

# Similar to Perl die
function die() {
    echo "$@" 1>&2 ; exit 1;
}


# -----------------------------------------------------------------------------
#  ARGUMENT PARSING
# -----------------------------------------------------------------------------

[ "$#" -eq 0 ] || die "0 arguments expected, $# provided"


# -----------------------------------------------------------------------------
#  COMMANDS
# -----------------------------------------------------------------------------

ret_val=1
retries="0"
set +e
while [[ $ret_val -ne 0 && $retries -lt 4 ]]
do
    echo "Attempting to archive enriched events (ret_val: ${ret_val}; retries: ${retries})"
    /usr/local/bin/aws --profile my-aws-profile s3 mv s3://snowplow-my-enrichment-output/main/enriched/good/ s3://snowplow-my-enrichment-archive/main/enriched/good/ --recursive --include "*.*"
    ret_val=$?
    retries=$[$retries+1]
done
if [ $ret_val -ne 0 ]; then
    echo "Error running archiving enriched events, exiting with return code ${ret_val}."
    exit $ret_val
fi
set -e

ret_val=1
retries="0"
set +e
while [[ $ret_val -ne 0 && $retries -lt 4 ]]
do
    echo "Attempting to archive enriched events (ret_val: ${ret_val}; retries: ${retries})"
    /usr/local/bin/aws --profile my-aws-profile s3 mv s3://snowplow-my-enrichment-output/main/shredded/good/ s3://snowplow-my-enrichment-archive/main/shredded/good/ --recursive --include "*.*"
    ret_val=$?
    retries=$[$retries+1]
done
if [ $ret_val -ne 0 ]; then
    echo "Error running archiving shredded events, exiting with return code ${ret_val}."
    exit $ret_val
fi
set -e

That makes a lot of sense - thanks!!

@alex @rahul Increasing the WLM query_slot_count didn’t have a significant impact here. It seems that all COPY queries are run in series through a giant transaction.

I guess it makes sense to split each COPY into its own transaction, and perform the bucket moves after each table has been loaded (to avoid duplication in case of failure). Thoughts?

I suspect the logic behind wrapping multiple COPYs in a single transaction is to avoid running into a scenario where a partial data load occurs. For example, in a single transaction scenario if a table of shredded events failed to load for some reason but atomic.events succeeded the transaction would fail. In the instance of individual transactions the atomic.events load could complete while the shredded table could fail.

@mike but that wouldn’t be a problem as long as the successful transactions had their directories moved to archive in s3, right? one would just have to rerun storageloader after fixing whatever went wrong.

@alex @mike We can start working on a PR to individualize COPY statements to take advantage of paralellism, if it makes sense to you. The bash script helped a lot, but storageloader is still taking quite a while to run.

The idea is that each COPY statement would be immediately followed by a script to move the corresponding files to archive

Hey @bernardosrulzon - I think parallelizing the COPY commands is a really interesting idea. I don’t think file moves are a good way of keeping track of all this though - they are far too brittle, and it will be hard to reason about (and recover) when individual table loads / file archives fail.

Did you see the new load manifest in R87?

It could be interesting to extend this to track loads right down to the per shredded table basis; the next step would be to use this table to actually drive which are the next table:run pairs to load.

I think the challenge with a PR right now is that there are major changes afoot in the StorageLoader which would be hard to synchronize with, including:

Hopefully within the next two months the StorageLoader codebase should have settled down again - and ought to be much more accessible then to boot! Perhaps it’s worth waiting till then for a PR?

Sound good, thanks!