HI @ihor , @egor
Yes this is due to the total sizes of the files and my cluster size not enough. I changed the instance type to m4.2xlarge it’s fixed my issue. Thanks a lot for everyone’s replies. This setup faster than 6 times. But needs to check cost
related posts :
We’re attempting to reprocess our raw Cloudfront logs and are batching up several of the old runs into one new run for efficiency. This means the input data is much larger than normal. We’re having difficulty processing this data, it probably related to the cluster resources but the exception I can find in the logs isn’t helpful nor is the failure mechanism.
The failure mechanism is that the job is still running as far as EMR is concerned and it’s just taking much longer than expected. Although…
We had a period where ETL wasn’t run and now trying to play catch up. I’m just rerunning about 842 files with an average of 5 megs per raw file that contains the events. so very small data files. i increased the number of core instances to 8 r3.xlarge hoping that would help get it done. it’s been running for about 15 hours now and utilization is very low on the 8 machines. i know the process prefers smaller batches and i’ll have to cancel and write a script to only give ETL say 50 files at …