How to engage EMRFS consistency when running snowplow-emr-etl-runner

RichardJ · November 9, 2017, 9:30pm

when we switched to a larger node type, we got error from the last step in shredding Elasticity S3DistCp Step: Shredded HDFS -> S3:
Error: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: 5A2F87935C17C792), S3 Extended Request ID: 6YcZaPRh5xyaWrQUz9KDpRyKhiGt59QcWVIXNvsOxk1oNRegZX6CgEN1974w1c0eIN35YgzTe/I=

That is caused by a lot of data is being pushed to S3 aggressively (according to AWS). The ways to mitigate is either reset “–targetSize=SIZE” to a large size or engage EMRFS consistency http://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-configure-consistent-view.html.

Can we modify the config.yml to implement the above suggestions given we are using snowplow-emr-etl-runner? What is a good way to do it?

Thanks,
Richard

BenFradet · November 10, 2017, 6:09pm

I agree that one way to go about this is to modify --targetSize combining it with --groupBy.

However, another way to go about it would be upstream. If you’re using the scala-stream-collector you can produce bigger files in s3 with the s3-loader by having a bigger buffer.

Those bigger files would then ripple through your pipeline after enrich and after shred. And you would end up with bigger files being moved to S3 and wouldn’t hit “SlowDown”.

This is particularly interesting because both the enrich and the shred jobs’s parallelism are dictated by the number of files. Through bigger files you can better utilize your cluster.

Finally, another way to go about it would be to run emr etl runner more frequently, depending on how frequently you’re running it now obviously.

Topic		Replies	Views
Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down issues AWS batch pipeline (Legacy)	2	6669	November 13, 2017
Shred problems using Batch Troubleshooting	1	949	December 5, 2020
Problem at S3 to HDFS S3DistCp step AWS batch pipeline (Legacy)	19	7280	June 4, 2021
Steps Elasticity S3DistCp Step: Raw Staging S3 -> Raw Archive S3	13	1238	January 17, 2020
Cluster: Snowplow ETLTerminated with errorsShut down as step failed Duplicate	2	2455	October 10, 2017

How to engage EMRFS consistency when running snowplow-emr-etl-runner

Related topics