Performance managing S3 buckets

vceron · May 4, 2017, 4:59pm

Hi all,

We have improved a little bit our batch-pipeline peformance this way.

Archive step

If for any reason the server running the EmrEtlRunner pipeline is in a different region than the “archive_enrich” step (Step 12 in the image) then the performance could be impacted, even if the source (:good) and target (:archive) buckets are in the same region.

Here below you can see what was our performance when using 2 different regions and when using only one.

Different regions : Peaks of 10k network packages during the archive step
One region : Peaks of +30k network packages during the archive step

Screenshot from 2017-05-04 18-57-16.png1128×178 20.9 KB

Clojure tracker

If your collectors are in different regions, logging into S3, we recommend to enable the S3 Cross-Region replication to sync the files into your :raw:in bucket

alex · May 4, 2017, 6:04pm

Many thanks for sharing these performance tips @vceron!

Just to add one more: cross-region loading of Redshift (S3 in one region, Redshift in another) is incredibly slow as well. Try to avoid that wherever possible - even if you have to add an aws s3 mv step in-between EmrEtlRunner and StorageLoader.

Topic		Replies	Views
Loading Redshift from S3 in a different region? AWS batch pipeline (Legacy)	3	1436	March 23, 2017
Question on EmrEtlRunner options For engineers	11	2744	March 14, 2017
Multi region in buckets and EmrEtlRunner error AWS batch pipeline (Legacy)	10	1530	September 27, 2017
Processing logs for a specific time period AWS batch pipeline (Legacy)	5	1518	November 14, 2016
EmrEtlRunner issues with --use-persistent-jobflow Troubleshooting	4	1293	October 17, 2019

Performance managing S3 buckets

Archive step

Clojure tracker

Related topics