How to estimate the EBS storage size needed for EMR process?

We are upgrading our snowplow stack to snowplow 89 Plain of Jars. We need to use the EBS storage with the EMR instances.

How do we estimate the EBS storage size needed for our EMR process?

Hi @rahul,

Comparing to Hadoop, EBS volume_size can be decreased for Spark since only the output datasets will be written to disk. So, you can take your biggest dataset size + additional 25-50%.

However, I’d recommend you upgrade to R92 or even R97 if you use Clojure collector to get all performance benefits.

@egor Which dataset should we consider? Input dataset or the Output dataset?

Thanks in advance :slight_smile:

The output one since it will be written to disk. Do note that Spark is memory hungry as opposed to Hadoop and you should allocate enough memory for it (e.g. using memory-optimized instances).

We believe that 6-7GB will be used on the disk presumably for the OS and managed software. We got caught out creating an 8GB EBS volume to process 500MB of data and it filled the disk although it appears there was only 2GB available for Spark and HDFS to use. Unfortunately I no longer have a record of the Snowplow enriched and shredded step output size. From the EMR monitoring tab it does look like it used ~16GB of the available capacity.

Thanks for the reply @gareth and @egor.