Here’s our experience. We’re typically able to process 10gb of data in an hour with 4 r3.8xlarges core nodes. We haven’t really tried r4s with ebs since r3s have instance storage. I haven’t measured the performance impact of instance vs ebs, but it could be negligible. But there would be advantages to using a beefier box like an r4.16xl.
Our “master” node is an m1.medium. The node does nothing but run a resource manager. There’s no reason to make it any larger than that.
A single 28gb file would definitely be a BAD idea as far as efficiency. For spark jobs, the sweet spot is between 100mb and 1gb. We produce 70mb files. You want enough files to make full use of the parallelism in Spark.
Would be curious to hear others data points (particularly with r4.16xlarge instances).
Thanks
Rick