I have been trying to get my EmtEtlRunner batch to resume work after a crash for the past 2 days without success. I hope someone can shed light on this error I have been getting.
When running EmtEtlRunner, on possibly arbritarary point, I would see this error and the process would stop:
Once it happened right during the staging on the files from log to etl/processing bucket. Last it happened during the moving of processing or files AFTER the EMR process completed successfully:
+-> ih-stats-prod-archive/raw/2017-09-14-49573367394269636484214968494006254435221012695805329410-49573367394269636484214968517991342696375263768313069570.lzo
x ih-stats-prod-etl/processing/2017-09-14-49573367394269636484214968569011639061571475733429092354-49573367394269636484214968595931999212750046373403951106.lzo.index
(t0) MOVE ih-stats-prod-etl/processing/2017-09-14-49573367394269636484214968595980356245534631609111674882-49573367394269636484214968621678492393082812290897543170.lzo -> ih-stats-prod-archive/raw/2017-09-14-49573367394269636484214968595980356245534631609111674882-49573367394269636484214968621678492393082812290897543170.lzo
x ih-stats-prod-etl/processing/2017-09-14-49573367394269636484214968494006254435221012695805329410-49573367394269636484214968517991342696375263768313069570.lzo
(t3) MOVE ih-stats-prod-etl/processing/2017-09-14-49573367394269636484214968595980356245534631609111674882-49573367394269636484214968621678492393082812290897543170.lzo.index -> ih-stats-prod-archive/raw/2017-09-14-49573367394269636484214968595980356245534631609111674882-49573367394269636484214968621678492393082812290897543170.lzo.index
Problem copying processing/2017-09-14-49573367394269636484214968494006254435221012695805329410- 49573367394269636484214968517991342696375263768313069570.lzo.index. Retrying.
Problem copying processing/2017-09-14-49573367394269636484214968518667132229539841820571205634-49573367394269636484214968543292951175089846286953742338.lzo.index. Retrying.
Problem copying processing/2017-09-14-49573367394269636484214968469657279502362758310430507010-49573367394269636484214968493616980321305101964110987266.lzo. Retrying.
Problem copying processing/2017-09-14-49573367394269636484214968569011639061571475733429092354-49573367394269636484214968595931999212750046373403951106.lzo. Retrying.
Problem copying processing/2017-09-14-49573367394269636484214968518667132229539841820571205634-49573367394269636484214968543292951175089846286953742338.lzo. Retrying.
Problem copying processing/2017-09-14-49573367394269636484214968352265746714503765354478567426-49573367394269636484214968375920798226903222662291521538.lzo. Retrying.
Problem copying processing/2017-09-14-49573367394269636484214968543492423935326260169499738114-49573367394269636484214968568758973565272018167196024834.lzo.index. Retrying.
+-> ih-stats-prod-archive/raw/2017-09-14-49573367394269636484214968595980356245534631609111674882-49573367394269636484214968621678492393082812290897543170.lzo
+-> ih-stats-prod-archive/raw/2017-09-14-49573367394269636484214968595980356245534631609111674882-49573367394269636484214968621678492393082812290897543170.lzo.index
+-> ih-stats-prod-archive/raw/2017-09-14-49573367394269636484214968543492423935326260169499738114-49573367394269636484214968568758973565272018167196024834.lzo
x ih-stats-prod-etl/processing/2017-09-14-49573367394269636484214968595980356245534631609111674882-49573367394269636484214968621678492393082812290897543170.lzo
(t0) MOVE ih-stats-prod-etl/processing/2017-09-14-49573367394269636484214968621706297686933948830635261954-49573367394269636484214968646628303458289537513540288514.lzo -> ih-stats-prod-archive/raw/2017-09-14-49573367394269636484214968621706297686933948830635261954-49573367394269636484214968646628303458289537513540288514.lzo
x ih-stats-prod-etl/processing/2017-09-14-49573367394269636484214968595980356245534631609111674882-49573367394269636484214968621678492393082812290897543170.lzo.index
(t3) MOVE ih-stats-prod-etl/processing/2017-09-14-49573367394269636484214968621706297686933948830635261954-49573367394269636484214968646628303458289537513540288514.lzo.index -> ih-stats-prod-archive/raw/2017-09-14-49573367394269636484214968621706297686933948830635261954-49573367394269636484214968646628303458289537513540288514.lzo.index
x ih-stats-prod-etl/processing/2017-09-14-49573367394269636484214968543492423935326260169499738114-49573367394269636484214968568758973565272018167196024834.lzo
(t7) MOVE ih-stats-prod-etl/processing/2017-09-14-49573367394269636484214968646824149441067107439842689026-49573367394269636484214968669317423240816906110763008002.lzo -> ih-stats-prod-archive/raw/2017-09-14-49573367394269636484214968646824149441067107439842689026-49573367394269636484214968669317423240816906110763008002.lzo
F, [2017-09-20T08:06:55.639000 #30738] FATAL -- :
I hope someone can shed some light on this issue as I am not able to resume my ETL anymore after weeks of successful runs with no changes to my config and the AWS setup in any way.
It happened the last three times I tried running the EMT runner. But at different stages (once when staging files and once when moving to archive as above).
Seeing that the EMT runner has processed the shredded/good and enriched/good files (step 7 from the Batch Pipeline Steps Matrix, I thought I could complete the tasks manually. So I manually moved the files from processing to archive/raw (step 10) and ran the StorageLoader (steps 11 and 12). This seemed to work and all enriched and shredded data started getting moved to their proper locations. BUT nothing got sent to my redshift db!
So I must have missed some step here. Please help as I am currently missing almost 2 days of data from my db…
Perhaps, I needed to run EMRETLRunner with --skip staging,emr,elasticsearch option (as per error at step 10 from the Batch Pipeline Matrix)? Would that have performed the missing step and load the structured and unstructured events data to rdb? However, that page does say that versions earlier than 92 does not load unstructured events and contexts.
I had tried to update the version earlier but that did result in breaking or conflicts hence I kept this version. The platform has been stable for the past 5-6 months until this happened.
UPDATE: I found that this error ha been logged in the the forums earlier as well. That error stemmed from the buckets and the EMR runner being in different regions; but I can confirm that all my buckets, rdb and runners are in the same region. Again, it seems to happen mostly when moving large amounts of data in step 10 of the pipeline (moving from processing to raw:archive). Assuming I can temporarily resume manually, what would be the command I would need to run? As per my above message, I tried it manually but no data got loaded into the rdb!
UPDATE 2: I found that many users of S3 have been seeing this message during large transfers, especially on that day (9/14). Here’s one forum describing it. Maybe it was a temporarily glitch? If so, how can I resume from where the runner left off?
Since the previous runner stopped at step 10, I was able to continue manually by running emr-etl-runner with the options --skip staging,emr,elasticsearch, followed by the storageLoader script. This worked fine.
Then, assuming all looks good and the s3 issue has been resolved, I ran the cron job as per regular schedule. This script stopped with a few minutes (Staging raw logs step) itself with a similar error:
Before this error, there were some messages about copying files over eg.
Problem copying 2017-09-20-49573367394269636484215052110206816154814634871373168642-49573367394269636484215052127038690341309122093487489026.lzo. Retrying.
Problem copying 2017-09-20-49573367394269636484215052149009708186985399105508868098-49573367394269636484215052170030510338444577242613809154.lzo.index. Retrying.
Now I seem to be back to square one!
Any suggestions and guidance will be greatly appreciated. Thanks!
This is indeed extremely odd. Is there anything special about box where you’re running EmrEtlRunner? What’s the output of java -version? Didn’t you compile EmrEtlRunner yourself?
Beside of double-checking everything (including orchestration server) is within same region I can only recommend you once again try to update to R92. In latest versions we’ve replaced all sluice data-moving logic with S3DistCp. Having that exception you’re seeing happens exactly in sluice - I believe this should fix a problem. But I would really like to know what exactly happens with your pipeline.
Hence I do not have java installed on the server (I am using an elastic beanstalk appliance to simplify my stack deployments). Again this has all been running quite smoothly. I am using cronic to run the batch script: snowplow-runner-and-loader.sh .
But currently I have been manually running the emr-etl-runner to help debug the issue, and the error is happening even then:
Will switching to R93 require me to update any other part of the pipeline (not seeing much documentation about the upgrade process, which is slightly worrisome, especially due to the major changes to the config and the removal of storageLoader) like the tracker, collectors, enrichers, iglu, kinesis-s3 scripts, configuration files, Redshift setup etc. I am worried this may break something else.