ETL runner overwriting processing logs

kazgurs1 · May 15, 2017, 8:27am

Hello,

I’m running EmrEtlRunner in rolling mode and there is an issue of missing data from time to time. The runner is moving different raw log files onto the same target in the processing bucket, which results in 1 hour being lost since the file is overwritten. Here’s the log output:

00:00:15.160 D, [2017-05-13T22:01:29.041000 #14038] DEBUG -- : Staging raw logs...
00:00:18.783   moving files from s3://elasticbeanstalk-eu-central-1-
890672996299/resources/environments/logs/publish/e-3vymgtnbsi/ to s3://my-snowplow-etl/processing/
00:00:28.155 (t0)    MOVE elasticbeanstalk-eu-central-1-890672996299/resources/environments/logs/publish/e-
3vymgtnbsi/i-06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494698465.gz -> my-
snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-central-1.i-
06f6975f15b3ae8e6.txt.gz(t2)    MOVE elasticbeanstalk-eu-central-1-
890672996299/resources/environments/logs/publish/e-3vymgtnbsi/i-
06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494702066.gz -> my-snowplow-
etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz
00:00:28.155 
00:00:32.927       +-> my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-
central-1.i-06f6975f15b3ae8e6.txt.gz
00:00:33.255       x elasticbeanstalk-eu-central-1-890672996299/resources/environments/logs/publish/e-3vymgtnbsi/i-
06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494698465.gz
00:00:35.485       +-> my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-
central-1.i-06f6975f15b3ae8e6.txt.gz
00:00:35.647       x elasticbeanstalk-eu-central-1-890672996299/resources/environments/logs/publish/e-3vymgtnbsi/i-
06f6975f15b3ae8e6/_var_log_tomcat8_rotated_localhost_access_log.txt1494702066.gz
00:00:35.648 D, [2017-05-13T22:01:49.540000 #14038] DEBUG -- : Waiting a minute to allow S3 to settle (eventual     
consistency)
00:01:35.658 D, [2017-05-13T22:02:49.549000 #14038] DEBUG -- : Initializing EMR jobflow
00:01:37.230 D, [2017-05-13T22:02:51.121000 #14038] DEBUG -- : EMR jobflow j-2LPBL697K2LYT started, waiting for         
jobflow to complete...
00:31:38.801 D, [2017-05-13T22:32:52.692000 #14038] DEBUG -- : EMR jobflow j-2LPBL697K2LYT completed     
successfully.
00:31:38.804 D, [2017-05-13T22:32:52.696000 #14038] DEBUG -- : Archiving CloudFront logs...
00:31:39.064   moving files from s3://my-snowplow-etl/processing/ to s3://my-snowplow-archive/raw/
00:31:39.512 (t0)    MOVE my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-
19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz -> my-snowplow-archive/raw/2017-05-
13/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz
00:31:44.699       +-> my-snowplow-archive/raw/2017-05-13/var_log_tomcat8_rotated_localhost_access_log.2017-05-
13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz
00:31:44.837       x my-snowplow-etl/processing/var_log_tomcat8_rotated_localhost_access_log.2017-05-13-19.eu-
central-1.i-06f6975f15b3ae8e6.txt.gz

Note how the both raw logs txt1494698465.gz and txt1494702066.gz got moved to the same destination file for processing named 2017-05-13-19.eu-central-1.i-06f6975f15b3ae8e6.txt.gz. As a result the 2017-05-13-19th hour logfile is missing. I thought initially that the files are maybe concatenated somehow, but that’s not the case. 1 hour is indeed missing from the archives and it’s not in the bad bucket. This happens sporadically and I haven’t been able to find any sort of pattern for it. My emr runner is scheduled to run every 2 hours and the EMR jobflow takes approximately 30 minutes. Any ideas?

snowplow-emr-etl-runner 0.22.0
 
versions:
    hadoop_enrich: 1.8.0
    hadoop_shred: 0.10.0
    hadoop_elasticsearch: 0.1.0

ihor · May 16, 2017, 12:07am

Hi @kazgurs1,

I wouldn’t expect that to be an issue. The i-06f6975f15b3ae8e6 part reflects the collector instance ID the logs were rotated from. I assume your collector consist of only one instance.

The missing data is likely to be blamed on something else. It appears from the logs you might have had not many data processed. However, you did mention you couldn’t find the logs in bad bucket either.

What are the buckets you are using? Could you provide the s3 section of your config.yml?

bernardosrulzon · May 16, 2017, 2:02am

It appears from the logs that the files are indeed being rewritten. It might have something to do with your server’s time zone, as 1494698465 is actually 18h UTC (not 19h). Does that make sense?

kazgurs1 · May 16, 2017, 6:03am

I can see that the collector’s timezone is UTC, but the server I’m running etlrunner on is CEST. Could that be it?

vytenisj · May 17, 2017, 7:52am

Server timezone change does not solve this issue or change anything apparent, any other ideas?

Topic		Replies	Views
How does EmrEtlRunner determine what the latest logs are in the raw "in" bucket? Enrichment	2	1867	September 8, 2016
Rerunning logs (new to Snowplow) For engineers	2	1514	December 19, 2019
Processing logs for a specific time period AWS batch pipeline (Legacy)	5	1517	November 14, 2016
Processing folder not empty - but no error on the ETL script! Enrichment	4	1343	June 14, 2017
Elasticity Spark Step: Enrich Raw Events never ends Troubleshooting	7	1652	April 5, 2018

ETL runner overwriting processing logs

Related topics