EMR ETL perfomance

Hi @alex

I ETL-EMR batch job is trying to process 150K files on S3 and Step 2 is taking way too long, it has completed in 20 hours! using this configuration below. I came across your small file post.

Do you think that is the issue, also where do I insert the S3Distcopy consolidation task, just looking for a specific pointer please, thanks for your help.

Current EMR Config:

master_instance_type: r3.xlarge
core_instance_count: 2
core_instance_type: r3.xlarge
task_instance_count: 3 # Increase to use spot instances
task_instance_type: r3.xlarge
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures

hadoop_enrich: 1.7.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.9.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0

what the step 2 logs said?

@ChocoPowwwa Hi,

Nothing stood out in the logs, I think it is the Hadoop small file issue mentioned in post from @alex I attached in my original post.

Only errors-
log4j:ERROR Failed to rename [/mnt/var/log/hadoop/steps/s-1HUUYLJ0H2U66/syslog] to [/mnt/var/log/hadoop/steps/s-1HUUYLJ0H2U66/syslog.2017-01-11-01].

How many events/how many files are going into the EMR job?

1 Like

@mike Hi,

47K pairs of LZO/Index.


That’s a pretty significant number of files. Is that for a large data range or is data being sinked on a very regular basis?

I imagine the time taken just to copy 47K files from S3 to HDFS would be reasonable in of itself - I wonder if it’s worth considering merging some LZO files together to create larger files rather than attempting to process 47K all at once. Thoughts @alex?

Based off this slide for EMR deep dive (slide 25) - bigger files = better performance.
Make sure to compress (we use the lzo), and dont forget to increase your timeout/number of records to accommodate the file size.

47k pairs -> what file size / time / # of records are you at for syncing with s3?

You also want to avoid “small file problem” that can have a negative effect not only on s3 copy but on EMR processing as well.

(BDT305) Amazon EMR Deep Dive and Best Practices from Amazon Web Services

@13scoobie @mike Thanks.

The files are from about 2 weeks of activity on a very low volume site. I am assuming each LZO is one event (about 8K - 900K each compressed) and 47K events per day is not that crazy (I would imagine for even a daily volume).

Question - Where do I add a step in the EMR job to S3distcpy and compress the files into few large one as stated by @alex in his post here


@13scoobie and @mike are right - that’s a ton of files!

To fix the problem going forwards, adjust the buffer configuration for your S3 Sink. To resolve the historical problem, you can do the following:

  1. Remove the .lzo.index files
  2. Run compaction using /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar
  3. (Optional) Re-index the files using s3://snowplow-hosted-assets/third-party/twitter/hadoop-lzo-0.4.20.jar
  4. Kick off the regular Snowplow job from the EMR phase
1 Like

Don’t the files get aggregated into 128mb chunks in the S3DistCp step? The large number of files wouldn’t explain why the EMR process took 20 hours to run, right?

1 Like

Sorry @alex I am unable to find that setting to adjust the buffer on Snowplow. And it seems Kinesis (streams) doesn’t allow for that I know the Kinesis Firehose does. Am I totally off track here?


@sachinsingh10 You’ll want to look at the buffer settings in your configuration file here (under buffer).