Hello,
I’m running EmrEtlRunner in rolling mode and there is an issue of missing data from time to time. The runner is moving different raw log files onto the same target in the processing bucket, which results in 1 hour being lost since the file is overwritten. Here’s the log output:
00:00:15.160 D, [2017-05-13T22:01:29.041000 #14038] DEBUG -- : Staging raw logs...
00:00:18.783 moving files from s3://elasticbeanstalk-eu-central-1-
890672996299/resources/environments/logs/publish/e-3vymgtnbsi/ to s3://my-snowplow-etl/processing/
00:00:28.155 (t0) MOVE elasticbeanstalk-eu-central-1-890672996299/resources/environments/logs/publish/e-
3vymgtnbsi/i-06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494698465.gz -> my-
snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-central-1.i-
06f6975f15b3ae8e6.txt.gz(t2) MOVE elasticbeanstalk-eu-central-1-
890672996299/resources/environments/logs/publish/e-3vymgtnbsi/i-
06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494702066.gz -> my-snowplow-
etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz
00:00:28.155
00:00:32.927 +-> my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-
central-1.i-06f6975f15b3ae8e6.txt.gz
00:00:33.255 x elasticbeanstalk-eu-central-1-890672996299/resources/environments/logs/publish/e-3vymgtnbsi/i-
06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494698465.gz
00:00:35.485 +-> my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-
central-1.i-06f6975f15b3ae8e6.txt.gz
00:00:35.647 x elasticbeanstalk-eu-central-1-890672996299/resources/environments/logs/publish/e-3vymgtnbsi/i-
06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494702066.gz
00:00:35.648 D, [2017-05-13T22:01:49.540000 #14038] DEBUG -- : Waiting a minute to allow S3 to settle (eventual
consistency)
00:01:35.658 D, [2017-05-13T22:02:49.549000 #14038] DEBUG -- : Initializing EMR jobflow
00:01:37.230 D, [2017-05-13T22:02:51.121000 #14038] DEBUG -- : EMR jobflow j-2LPBL697K2LYT started, waiting for
jobflow to complete...
00:31:38.801 D, [2017-05-13T22:32:52.692000 #14038] DEBUG -- : EMR jobflow j-2LPBL697K2LYT completed
successfully.
00:31:38.804 D, [2017-05-13T22:32:52.696000 #14038] DEBUG -- : Archiving CloudFront logs...
00:31:39.064 moving files from s3://my-snowplow-etl/processing/ to s3://my-snowplow-archive/raw/
00:31:39.512 (t0) MOVE my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-
19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz -> my-snowplow-archive/raw/2017-05-
13/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz
00:31:44.699 +-> my-snowplow-archive/raw/2017-05-13/var_log_tomcat8_rotated_localhost_access_log.2017-05-
13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz
00:31:44.837 x my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-
central-1.i-06f6975f15b3ae8e6.txt.gz
Note how the both raw logs txt1494698465.gz and txt1494702066.gz got moved to the same destination file for processing named 2017-05-13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz. As a result the 2017-05-13-19th hour logfile is missing. I thought initially that the files are maybe concatenated somehow, but that’s not the case. 1 hour is indeed missing from the archives and it’s not in the bad bucket. This happens sporadically and I haven’t been able to find any sort of pattern for it. My emr runner is scheduled to run every 2 hours and the EMR jobflow takes approximately 30 minutes. Any ideas?
snowplow-emr-etl-runner 0.22.0
versions:
hadoop_enrich: 1.8.0
hadoop_shred: 0.10.0
hadoop_elasticsearch: 0.1.0