ETL runner overwriting processing logs

Hello,

I’m running EmrEtlRunner in rolling mode and there is an issue of missing data from time to time. The runner is moving different raw log files onto the same target in the processing bucket, which results in 1 hour being lost since the file is overwritten. Here’s the log output:

00:00:15.160 D, [2017-05-13T22:01:29.041000 #14038] DEBUG -- : Staging raw logs...
00:00:18.783   moving files from s3://elasticbeanstalk-eu-central-1-
890672996299/resources/environments/logs/publish/e-3vymgtnbsi/ to s3://my-snowplow-etl/processing/
00:00:28.155 (t0)    MOVE elasticbeanstalk-eu-central-1-890672996299/resources/environments/logs/publish/e-
3vymgtnbsi/i-06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494698465.gz -> my-
snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-central-1.i-
06f6975f15b3ae8e6.txt.gz(t2)    MOVE elasticbeanstalk-eu-central-1-
890672996299/resources/environments/logs/publish/e-3vymgtnbsi/i-
06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494702066.gz -> my-snowplow-
etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz
00:00:28.155 
00:00:32.927       +-> my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-
central-1.i-06f6975f15b3ae8e6.txt.gz
00:00:33.255       x elasticbeanstalk-eu-central-1-890672996299/resources/environments/logs/publish/e-3vymgtnbsi/i-
06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494698465.gz
00:00:35.485       +-> my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-
central-1.i-06f6975f15b3ae8e6.txt.gz
00:00:35.647       x elasticbeanstalk-eu-central-1-890672996299/resources/environments/logs/publish/e-3vymgtnbsi/i-
06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494702066.gz
00:00:35.648 D, [2017-05-13T22:01:49.540000 #14038] DEBUG -- : Waiting a minute to allow S3 to settle (eventual     
consistency)
00:01:35.658 D, [2017-05-13T22:02:49.549000 #14038] DEBUG -- : Initializing EMR jobflow
00:01:37.230 D, [2017-05-13T22:02:51.121000 #14038] DEBUG -- : EMR jobflow j-2LPBL697K2LYT started, waiting for         
jobflow to complete...
00:31:38.801 D, [2017-05-13T22:32:52.692000 #14038] DEBUG -- : EMR jobflow j-2LPBL697K2LYT completed     
successfully.
00:31:38.804 D, [2017-05-13T22:32:52.696000 #14038] DEBUG -- : Archiving CloudFront logs...
00:31:39.064   moving files from s3://my-snowplow-etl/processing/ to s3://my-snowplow-archive/raw/
00:31:39.512 (t0)    MOVE my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-
19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz -> my-snowplow-archive/raw/2017-05-
13/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz
00:31:44.699       +-> my-snowplow-archive/raw/2017-05-13/var_log_tomcat8_rotated_localhost_access_log.2017-05-
13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz
00:31:44.837       x my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-
central-1.i-06f6975f15b3ae8e6.txt.gz

Note how the both raw logs txt1494698465.gz and txt1494702066.gz got moved to the same destination file for processing named 2017-05-13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz. As a result the 2017-05-13-19th hour logfile is missing. I thought initially that the files are maybe concatenated somehow, but that’s not the case. 1 hour is indeed missing from the archives and it’s not in the bad bucket. This happens sporadically and I haven’t been able to find any sort of pattern for it. My emr runner is scheduled to run every 2 hours and the EMR jobflow takes approximately 30 minutes. Any ideas?

snowplow-emr-etl-runner 0.22.0
 
versions:
    hadoop_enrich: 1.8.0
    hadoop_shred: 0.10.0
    hadoop_elasticsearch: 0.1.0

Hi @kazgurs1,

I wouldn’t expect that to be an issue. The i-06f6975f15b3ae8e6 part reflects the collector instance ID the logs were rotated from. I assume your collector consist of only one instance.

The missing data is likely to be blamed on something else. It appears from the logs you might have had not many data processed. However, you did mention you couldn’t find the logs in bad bucket either.

What are the buckets you are using? Could you provide the s3 section of your config.yml?

It appears from the logs that the files are indeed being rewritten. It might have something to do with your server’s time zone, as 1494698465 is actually 18h UTC (not 19h). Does that make sense?

I can see that the collector’s timezone is UTC, but the server I’m running etlrunner on is CEST. Could that be it?

Server timezone change does not solve this issue or change anything apparent, any other ideas?