@trung, are you using Spark enrichment or Hadoop? We have switched to Spark starting from R89 release. The Spark framework differs from Hadoop in that it does most of the processing in memory while Hadoop relies on writing temp data to HDFS. Thus, Spark is more productive as it eliminates “costly” writing to the disk.
In other words, for Spark configuration, you would want to use instances like r4 that comes with a higher volume of memory. This, however, is not enough for high volume data processing. Spark requires tuning to ensure the resources are used to their maximum (not idle unnecessarily).
It’s hard to know for sure what the best configuration would be just knowing the volume of events. No event is the same; lots depends on the volume and complexity of self-describing events. We have a rough correlation between the size of enriched files and the EMR cluster/Spark configuration. If you provide what the typical payload size of compresed/uncompressed files is we can provide a better-adjusted configuration.
For now, if we simply replace your 2x c4.large for 1x r4.xlarge the Spark configuration will be as below
Then you are probably already using Spark Enrich (from version 1.9.0) and thus can try the config I provided earlier. You can also let us know the typical size of enriched files per ETL process and we can adjust the Spark configuration accordingly. Though, I already gave you all the info you need earlier.