java.nio.file.FileAlreadyExistsException: ./ip_geo

We’re attempting to reprocess our raw Cloudfront logs and are batching up several of the old runs into one new run for efficiency. This means the input data is much larger than normal. We’re having difficulty processing this data, it probably related to the cluster resources but the exception I can find in the logs isn’t helpful nor is the failure mechanism.

The failure mechanism is that the job is still running as far as EMR is concerned and it’s just taking much longer than expected. Although admittedly we don’t have a good expectation here. We processed one week of data and that took 40mins for the enrich step and the next week was still going after 1.5hrs. The data volume is a little larger for the failing job.

Looking through the logs the only error I can find is

java.nio.file.FileAlreadyExistsException: ./ip_geo

This appears in several applications and the drivers. Does this point at anything? I’ve pasted part of the log file below. It goes over the post character limit if I post the whole thing.

Thanks
Gareth

P.S. There is a lot of grumbling I could do about attaching documents discourse. For a start .txt files aren’t authorised.

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/yarn/usercache/hadoop/filecache/10/__spark_libs__2445741660547156520.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
...
18/01/10 10:49:42 INFO ApplicationMaster: 
===============================================================================
YARN executor launch context:
  env:
    CLASSPATH -> /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar<CPS>{{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>/usr/lib/hadoop-lzo/lib/*<CPS>/usr/share/aws/emr/emrfs/conf<CPS>/usr/share/aws/emr/emrfs/lib/*<CPS>/usr/share/aws/emr/emrfs/auxlib/*<CPS>/usr/share/aws/emr/lib/*<CPS>/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar<CPS>/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar<CPS>/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-api-jdo.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-core.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-rdbms.jar<CPS>/usr/share/aws/emr/cloudwatch-sink/lib/*<CPS>/usr/share/aws/aws-java-sdk/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>/usr/lib/hadoop-lzo/lib/*<CPS>/usr/share/aws/emr/emrfs/conf<CPS>/usr/share/aws/emr/emrfs/lib/*<CPS>/usr/share/aws/emr/emrfs/auxlib/*<CPS>/usr/share/aws/emr/lib/*<CPS>/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar<CPS>/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar<CPS>/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar<CPS>/usr/share/aws/emr/cloudwatch-sink/lib/*<CPS>/usr/share/aws/aws-java-sdk/*
    SPARK_YARN_STAGING_DIR -> hdfs://ip-10-163-2-164.aws.metail.org:8020/user/hadoop/.sparkStaging/application_1515576944603_0003
    SPARK_USER -> hadoop
    SPARK_YARN_MODE -> true
    SPARK_PUBLIC_DNS -> ip-10-163-2-169.aws.metail.org

  command:
    LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:$LD_LIBRARY_PATH" \ 
      {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx5120m \ 
      '-verbose:gc' \ 
      '-XX:+PrintGCDetails' \ 
      '-XX:+PrintGCDateStamps' \ 
      '-XX:+UseConcMarkSweepGC' \ 
      '-XX:CMSInitiatingOccupancyFraction=70' \ 
      '-XX:MaxHeapFreeRatio=70' \ 
      '-XX:+CMSClassUnloadingEnabled' \ 
      '-XX:OnOutOfMemoryError=kill -9 %p' \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      '-Dspark.history.ui.port=18080' \ 
      -Dspark.yarn.app.container.log.dir=<LOG_DIR> \ 
      org.apache.spark.executor.CoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://CoarseGrainedScheduler@10.163.2.169:45603 \ 
      --executor-id \ 
      <executorId> \ 
      --hostname \ 
      <hostname> \ 
      --cores \ 
      4 \ 
      --app-id \ 
      application_1515576944603_0003 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      1><LOG_DIR>/stdout \ 
      2><LOG_DIR>/stderr

  resources:
    __app__.jar -> resource { scheme: "hdfs" host: "ip-10-163-2-164.aws.metail.org" port: 8020 file: "/user/hadoop/.sparkStaging/application_1515576944603_0003/snowplow-spark-enrich-1.10.0.jar" } size: 61014239 timestamp: 1515577472794 type: FILE visibility: PRIVATE
    __spark_libs__ -> resource { scheme: "hdfs" host: "ip-10-163-2-164.aws.metail.org" port: 8020 file: "/user/hadoop/.sparkStaging/application_1515576944603_0003/__spark_libs__2445741660547156520.zip" } size: 204439621 timestamp: 1515577455622 type: ARCHIVE visibility: PRIVATE
    __spark_conf__ -> resource { scheme: "hdfs" host: "ip-10-163-2-164.aws.metail.org" port: 8020 file: "/user/hadoop/.sparkStaging/application_1515576944603_0003/__spark_conf__.zip" } size: 79242 timestamp: 1515577473183 type: ARCHIVE visibility: PRIVATE

===============================================================================
...
18/01/10 10:49:42 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
18/01/10 10:49:46 INFO SparkContext: Added file http://snowplow-hosted-assets.s3.amazonaws.com/third-party/maxmind/GeoLiteCity.dat at http://snowplow-hosted-assets.s3.amazonaws.com/third-party/maxmind/GeoLiteCity.dat with timestamp 1515581386285
18/01/10 10:49:46 INFO Utils: Fetching http://snowplow-hosted-assets.s3.amazonaws.com/third-party/maxmind/GeoLiteCity.dat to /mnt/yarn/usercache/hadoop/appcache/application_1515576944603_0003/spark-c1e13026-f9b2-48fd-a7ba-0fb6663a4157/userFiles-514879c0-7aa3-4e12-a063-ed8de7171b2e/fetchFileTemp5616370862811701734.tmp
...
18/01/10 10:49:51 INFO FileInputFormat: Total input paths to process : 44
...
18/01/10 10:49:51 INFO DAGScheduler: Submitting 44 missing tasks from ResultStage 0 (MapPartitionsRDD[7] at text at EnrichJob.scala:202) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
18/01/10 10:49:51 INFO YarnClusterScheduler: Adding task set 0.0 with 44 tasks
18/01/10 10:49:52 INFO YarnAllocator: Driver requested a total number of 1 executor(s).
18/01/10 10:49:52 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
18/01/10 10:49:52 INFO YarnAllocator: Will request 1 executor container(s), each with 4 core(s) and 5632 MB memory (including 512 MB of overhead)
18/01/10 10:49:52 INFO YarnAllocator: Submitted container request for host ip-10-163-2-176.aws.metail.org,10.163.2.163,10.163.2.147,10.163.2.169,10.163.2.176,ip-10-163-2-147.aws.metail.org,ip-10-163-2-163.aws.metail.org.
18/01/10 10:49:53 INFO AMRMClientImpl: Received new token for : ip-10-163-2-176.aws.metail.org:8041
18/01/10 10:49:53 INFO YarnAllocator: Launching container container_1515576944603_0003_02_000002 on host ip-10-163-2-176.aws.metail.org for executor with ID 1
18/01/10 10:49:53 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
...
18/01/10 10:49:58 WARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 2, ip-10-163-2-176.aws.metail.org, executor 1): java.nio.file.FileAlreadyExistsException: ./ip_geo
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixFileSystemProvider.createSymbolicLink(UnixFileSystemProvider.java:457)
	at java.nio.file.Files.createSymbolicLink(Files.java:1043)
	at com.snowplowanalytics.snowplow.enrich.spark.EnrichJob$$anonfun$createSymbolicLinks$2.apply(EnrichJob.scala:131)
	at com.snowplowanalytics.snowplow.enrich.spark.EnrichJob$$anonfun$createSymbolicLinks$2.apply(EnrichJob.scala:128)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	at com.snowplowanalytics.snowplow.enrich.spark.EnrichJob$.createSymbolicLinks(EnrichJob.scala:128)
	at com.snowplowanalytics.snowplow.enrich.spark.EnrichJob$$anonfun$4.apply(EnrichJob.scala:181)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

18/01/10 10:49:58 INFO TaskSetManager: Lost task 5.0 in stage 0.0 (TID 1) on ip-10-163-2-176.aws.metail.org, executor 1: java.nio.file.FileAlreadyExistsException (./ip_geo) [duplicate 1]
18/01/10 10:49:58 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on ip-10-163-2-176.aws.metail.org, executor 1: java.nio.file.FileAlreadyExistsException (./ip_geo) [duplicate 2]
18/01/10 10:49:58 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 5, ip-10-163-2-176.aws.metail.org, executor 2, partition 0, NODE_LOCAL, 4888 bytes)
18/01/10 10:49:58 INFO TaskSetManager: Starting task 5.1 in stage 0.0 (TID 6, ip-10-163-2-176.aws.metail.org, executor 1, partition 5, NODE_LOCAL, 4889 bytes)
18/01/10 10:49:58 INFO TaskSetManager: Starting task 12.1 in stage 0.0 (TID 7, ip-10-163-2-176.aws.metail.org, executor 2, partition 12, NODE_LOCAL, 4889 bytes)
18/01/10 10:49:58 INFO YarnAllocator: Launching container container_1515576944603_0003_02_000006 on host ip-10-163-2-176.aws.metail.org for executor with ID 5
18/01/10 10:49:58 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
...
18/01/10 10:50:12 INFO TaskSetManager: Starting task 36.0 in stage 0.0 (TID 41, ip-10-163-2-169.aws.metail.org, executor 11, partition 36, RACK_LOCAL, 4889 bytes)
18/01/10 10:50:12 INFO TaskSetManager: Starting task 37.0 in stage 0.0 (TID 42, ip-10-163-2-169.aws.metail.org, executor 11, partition 37, RACK_LOCAL, 4889 bytes)
18/01/10 10:50:12 WARN TaskSetManager: Lost task 33.0 in stage 0.0 (TID 38, ip-10-163-2-169.aws.metail.org, executor 11): java.nio.file.FileAlreadyExistsException: ./ip_geo
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixFileSystemProvider.createSymbolicLink(UnixFileSystemProvider.java:457)
	at java.nio.file.Files.createSymbolicLink(Files.java:1043)
	at com.snowplowanalytics.snowplow.enrich.spark.EnrichJob$$anonfun$createSymbolicLinks$2.apply(EnrichJob.scala:131)
	at com.snowplowanalytics.snowplow.enrich.spark.EnrichJob$$anonfun$createSymbolicLinks$2.apply(EnrichJob.scala:128)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	at com.snowplowanalytics.snowplow.enrich.spark.EnrichJob$.createSymbolicLinks(EnrichJob.scala:128)
	at com.snowplowanalytics.snowplow.enrich.spark.EnrichJob$$anonfun$4.apply(EnrichJob.scala:181)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

18/01/10 10:50:12 INFO TaskSetManager: Lost task 32.0 in stage 0.0 (TID 37) on ip-10-163-2-169.aws.metail.org, executor 11: java.nio.file.FileAlreadyExistsException (./ip_geo) [duplicate 1]
18/01/10 10:50:12 INFO TaskSetManager: Starting task 32.1 in stage 0.0 (TID 43, ip-10-163-2-169.aws.metail.org, executor 11, partition 32, RACK_LOCAL, 4889 bytes)
18/01/10 10:50:12 INFO TaskSetManager: Lost task 34.0 in stage 0.0 (TID 39) on ip-10-163-2-169.aws.metail.org, executor 11: java.nio.file.FileAlreadyExistsException (./ip_geo) [duplicate 2]
18/01/10 10:50:13 INFO TaskSetManager: Starting task 34.1 in stage 0.0 (TID 44, ip-10-163-2-176.aws.metail.org, executor 8, partition 34, RACK_LOCAL, 4889 bytes)
18/01/10 10:50:13 INFO TaskSetManager: Lost task 27.0 in stage 0.0 (TID 32) on ip-10-163-2-176.aws.metail.org, executor 8: java.nio.file.FileAlreadyExistsException (./ip_geo) [duplicate 3]
18/01/10 10:50:13 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.163.2.176:53448) with ID 10
18/01/10 10:50:13 INFO TaskSetManager: Starting task 27.1 in stage 0.0 (TID 45, ip-10-163-2-176.aws.metail.org, executor 10, partition 27, RACK_LOCAL, 4888 bytes)
18/01/10 10:50:13 INFO TaskSetManager: Starting task 33.1 in stage 0.0 (TID 46, ip-10-163-2-176.aws.metail.org, executor 10, partition 33, RACK_LOCAL, 4888 bytes)
18/01/10 10:50:13 INFO TaskSetManager: Starting task 38.0 in stage 0.0 (TID 47, ip-10-163-2-176.aws.metail.org, executor 10, partition 38, RACK_LOCAL, 4889 bytes)
18/01/10 10:50:13 INFO TaskSetManager: Starting task 39.0 in stage 0.0 (TID 48, ip-10-163-2-176.aws.metail.org, executor 10, partition 39, RACK_LOCAL, 4888 bytes)
18/01/10 10:50:13 INFO ExecutorAllocationManager: New executor 10 has registered (new total is 10)
18/01/10 10:50:14 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-163-2-176.aws.metail.org:35969 with 2.8 GB RAM, BlockManagerId(10, ip-10-163-2-176.aws.metail.org, 35969, None)
18/01/10 10:50:14 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.163.2.176:53468) with ID 9
18/01/10 10:50:14 INFO TaskSetManager: Starting task 40.0 in stage 0.0 (TID 49, ip-10-163-2-176.aws.metail.org, executor 9, partition 40, RACK_LOCAL, 4889 bytes)
18/01/10 10:50:14 INFO TaskSetManager: Starting task 41.0 in stage 0.0 (TID 50, ip-10-163-2-176.aws.metail.org, executor 9, partition 41, RACK_LOCAL, 4889 bytes)
18/01/10 10:50:14 INFO TaskSetManager: Starting task 42.0 in stage 0.0 (TID 51, ip-10-163-2-176.aws.metail.org, executor 9, partition 42, RACK_LOCAL, 4889 bytes)
18/01/10 10:50:14 INFO TaskSetManager: Starting task 43.0 in stage 0.0 (TID 52, ip-10-163-2-176.aws.metail.org, executor 9, partition 43, RACK_LOCAL, 4889 bytes)
18/01/10 10:50:14 INFO ExecutorAllocationManager: New executor 9 has registered (new total is 11)
18/01/10 10:50:15 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-163-2-176.aws.metail.org:39117 with 2.8 GB RAM, BlockManagerId(9, ip-10-163-2-176.aws.metail.org, 39117, None)
18/01/10 10:50:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-10-163-2-176.aws.metail.org:35969 (size: 39.8 KB, free: 2.8 GB)
18/01/10 10:50:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-10-163-2-176.aws.metail.org:39117 (size: 39.8 KB, free: 2.8 GB)
18/01/10 10:50:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-10-163-2-176.aws.metail.org:35969 (size: 26.2 KB, free: 2.8 GB)
18/01/10 10:50:19 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-10-163-2-176.aws.metail.org:39117 (size: 26.2 KB, free: 2.8 GB)
18/01/10 10:51:34 INFO YarnAllocator: Yarn node state updated for host ip-10-163-2-147.aws.metail.org to RUNNING
18/01/10 10:51:40 INFO YarnAllocator: Yarn node state updated for host ip-10-163-2-163.aws.metail.org to RUNNING
18/01/10 11:01:03 INFO TaskSetManager: Finished task 32.1 in stage 0.0 (TID 43) in 651336 ms on ip-10-163-2-169.aws.metail.org (executor 11) (1/44)
18/01/10 11:01:28 INFO TaskSetManager: Finished task 36.0 in stage 0.0 (TID 41) in 675492 ms on ip-10-163-2-169.aws.metail.org (executor 11) (2/44)
18/01/10 11:01:29 INFO TaskSetManager: Finished task 35.0 in stage 0.0 (TID 40) in 679281 ms on ip-10-163-2-169.aws.metail.org (executor 11) (3/44)
18/01/10 11:02:29 INFO TaskSetManager: Finished task 37.0 in stage 0.0 (TID 42) in 737187 ms on ip-10-163-2-169.aws.metail.org (executor 11) (4/44)
18/01/10 11:02:29 INFO YarnAllocator: Driver requested a total number of 10 executor(s).
18/01/10 11:02:41 INFO BlockManagerInfo: Added rdd_3_12 in memory on ip-10-163-2-176.aws.metail.org:35641 (size: 933.0 MB, free: 1894.0 MB)
18/01/10 11:02:50 INFO TaskSetManager: Finished task 12.2 in stage 0.0 (TID 9) in 769403 ms on ip-10-163-2-176.aws.metail.org (executor 2) (5/44)
18/01/10 11:03:29 INFO ExecutorAllocationManager: Request to remove executorIds: 11
18/01/10 11:03:29 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 11
18/01/10 11:03:29 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 11
18/01/10 11:03:29 INFO YarnAllocator: Driver requested a total number of 9 executor(s).
18/01/10 11:03:29 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 11.
18/01/10 11:03:29 INFO ExecutorAllocationManager: Removing executor 11 because it has been idle for 60 seconds (new desired total will be 10)
18/01/10 11:03:34 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 11.
18/01/10 11:03:34 INFO DAGScheduler: Executor lost: 11 (epoch 0)
18/01/10 11:03:34 INFO BlockManagerMasterEndpoint: Trying to remove executor 11 from BlockManagerMaster.
18/01/10 11:03:34 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(11, ip-10-163-2-169.aws.metail.org, 44603, None)
18/01/10 11:03:34 INFO BlockManagerMaster: Removed 11 successfully in removeExecutor
18/01/10 11:03:34 INFO YarnClusterScheduler: Executor 11 on ip-10-163-2-169.aws.metail.org killed by driver.
18/01/10 11:03:34 INFO ExecutorAllocationManager: Existing executor 11 has been removed (new total is 10)
18/01/10 11:03:46 INFO TaskSetManager: Finished task 0.1 in stage 0.0 (TID 5) in 828845 ms on ip-10-163-2-176.aws.metail.org (executor 2) (6/44)
18/01/10 11:05:19 INFO BlockManagerInfo: Added rdd_3_26 in memory on ip-10-163-2-176.aws.metail.org:35641 (size: 1214.9 MB, free: 679.1 MB)
18/01/10 11:05:26 INFO TaskSetManager: Finished task 26.1 in stage 0.0 (TID 8) in 925990 ms on ip-10-163-2-176.aws.metail.org (executor 2) (7/44)
18/01/10 11:09:22 INFO TaskSetManager: Finished task 19.0 in stage 0.0 (TID 3) in 1166776 ms on ip-10-163-2-176.aws.metail.org (executor 1) (8/44)

what does your emr config look like?

I faced the same issue while reprocessing recovery batches ( custom recovery process ), in order to use the cluster more efficiently i wanted to submit multiple jobs on the same cluster instead of recreating it to save time. I did come across a ticket that said this issue was fixed but in the end i ended up spinning a new cluster with clean hdfs for every batch.

1 Like

@mjensen It’s got a master: 1xm1.medium, core: 2xm4.4xlarge. There no additional configuration customisation.

That warning is from the cluster trying to download the GeoliteCity database locally but the file apparently already exists. Are you cleaning up or terminating the cluster after each recovery?

Yes this is a new cluster for each new batch/recovery. We’re using the snowplow-emr-etl-runner 0.29.0 with yarn.resourcemanager.am.max-attempts: "1" so it shouldn’t exist on start up and I hopefully nothing is retrying the containers.

Presumably it is harmless? I think our problem is more related to the ETL very very slow in larger batches, but might be a bit of a mystery.

This is caused by the fact that there are multiple executors running on the same machine.
Since, Scala Common Enrich relies on a file named ./ip_geo they’re all trying to download the file, possibly at the same time, which triggers the issue.

The failed tasks are immediately retried however and successfully started since they are now able to detect that the file already exists. This poses no threat to the pipeline integrity given that the task is idempotently restarted and, furthermore, it has not processed data yet.

This is a shortcoming that we will be addressing in the future.

3 Likes

Thanks for the reply. My problem was unrelated, it’s good to know the answer :slight_smile: