EMR ETL perfomance

sachinsingh10 · January 11, 2017, 7:06am

I ETL-EMR batch job is trying to process 150K files on S3 and Step 2 is taking way too long, it has completed in 20 hours! using this configuration below. I came across your small file post.

Do you think that is the issue, also where do I insert the S3Distcopy consolidation task, just looking for a specific pointer please, thanks for your help.

Current EMR Config:

jobflow:
master_instance_type: r3.xlarge
core_instance_count: 2
core_instance_type: r3.xlarge
task_instance_count: 3 # Increase to use spot instances
task_instance_type: r3.xlarge
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures

versions:
hadoop_enrich: 1.7.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.9.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0

ChocoPowwwa · January 22, 2017, 12:20pm

what the step 2 logs said?

sachinsingh10 · January 22, 2017, 11:29pm

@ChocoPowwwa Hi,

Nothing stood out in the logs, I think it is the Hadoop small file issue mentioned in post from @alex I attached in my original post.

Only errors-
log4j:ERROR Failed to rename [/mnt/var/log/hadoop/steps/s-1HUUYLJ0H2U66/syslog] to [/mnt/var/log/hadoop/steps/s-1HUUYLJ0H2U66/syslog.2017-01-11-01].

mike · January 23, 2017, 2:30am

How many events/how many files are going into the EMR job?

sachinsingh10 · January 23, 2017, 4:13am

@mike Hi,

47K pairs of LZO/Index.

Regards

mike · January 23, 2017, 10:53pm

That’s a pretty significant number of files. Is that for a large data range or is data being sinked on a very regular basis?

I imagine the time taken just to copy 47K files from S3 to HDFS would be reasonable in of itself - I wonder if it’s worth considering merging some LZO files together to create larger files rather than attempting to process 47K all at once. Thoughts @alex?

13scoobie · January 23, 2017, 11:01pm

Based off this slide for EMR deep dive (slide 25) - bigger files = better performance.
Make sure to compress (we use the lzo), and dont forget to increase your timeout/number of records to accommodate the file size.

47k pairs -> what file size / time / # of records are you at for syncing with s3?

You also want to avoid “small file problem” that can have a negative effect not only on s3 copy but on EMR processing as well.

(BDT305) Amazon EMR Deep Dive and Best Practices from Amazon Web Services

sachinsingh10 · January 23, 2017, 11:38pm

@13scoobie @mike Thanks.

The files are from about 2 weeks of activity on a very low volume site. I am assuming each LZO is one event (about 8K - 900K each compressed) and 47K events per day is not that crazy (I would imagine for even a daily volume).

Question - Where do I add a step in the EMR job to S3distcpy and compress the files into few large one as stated by @alex in his post here

Regards
SS.

alex · January 24, 2017, 12:31am

@13scoobie and @mike are right - that’s a ton of files!

To fix the problem going forwards, adjust the buffer configuration for your S3 Sink. To resolve the historical problem, you can do the following:

Remove the .lzo.index files
Run compaction using /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar
(Optional) Re-index the files using s3://snowplow-hosted-assets/third-party/twitter/hadoop-lzo-0.4.20.jar
Kick off the regular Snowplow job from the EMR phase

bernardosrulzon · January 24, 2017, 10:32am

Don’t the files get aggregated into 128mb chunks in the S3DistCp step? The large number of files wouldn’t explain why the EMR process took 20 hours to run, right?

sachinsingh10 · January 25, 2017, 3:07am

Sorry @alex I am unable to find that setting to adjust the buffer on Snowplow. And it seems Kinesis (streams) doesn’t allow for that I know the Kinesis Firehose does. Am I totally off track here?

Regards
SS

mike · January 25, 2017, 3:40am

@sachinsingh10 You’ll want to look at the buffer settings in your configuration file here (under buffer).

Topic		Replies	Views
ETL very very slow in larger batches Troubleshooting	24	5427	January 29, 2018
EMR Shredding fails randomly Enrichment	12	1660	February 23, 2019
Has anyone benchmarked ETL EMR? AWS batch pipeline (Legacy)	0	1315	November 21, 2016
ETL Shred step taking longer and longer AWS batch pipeline (Legacy)	24	3322	March 30, 2017
EMR job failing Troubleshooting	4	952	November 15, 2021

EMR ETL perfomance

Related topics