Multi region in buckets and EmrEtlRunner error

T_P · September 22, 2017, 4:51pm

Hi,

Because of some latency we had the necessity to try a secondary collector in a different region.
Primary runs on eu-west-1, the secondary runs on ap-southeast-1.

s3:
region: eu-west-1
buckets:
raw:
in:
- “s3n://elasticbeanstalk-eu-west-1-602XXX746X/resources/environments/logs/publish/e-bxvXXXd84p”
- “s3n://elasticbeanstalk-ap-southeast-1-602XXX746X/resources/environments/logs/publish/e-vbmXXX2rwb”

First IN bucket is fetched as always without any problem. Now, when the second bucket was added an error is return when trying to run EmrEtlRunner:

The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.

I’ve tried with the indicated endpoint and some small changes but ain’t got any success.

Anyone kind to help?

TIA,

alex · September 22, 2017, 9:23pm

Yes, that won’t work - you are specifying the s3:region as eu-west-1 and then your second bucket is in a different region.

T_P · September 25, 2017, 9:09am

@alex
Is there any workaround that can be tried without going through the creation of a new EMR instance in the secondary region?

alex · September 25, 2017, 9:36am

Right - you can set up a manual move of the files using $ aws s3 mv to get the files all into the same region.

T_P · September 27, 2017, 10:39am

@alex

I’ve gathered the logs from both regions to an intermediate bucket on the same location as the EMR.
During that process I had to rename the files (see below) so they don’t get overwrited, because I’ve got the exact same log files both from ap-southeast-1 and eu-west-1 region.

resources/environments/logs/publish/e-smgk4gppuv/i-1a5c92f8e77605a3d _var_log_tomcat8_rotated_localhost_access_log.txt1506423661 -> _var_log_tomcat8_rotated_localhost_access_log.2017-09-26-12.eu-west-1.i-1a5c92f8e77605a3d.txt.gz

Then setted the IN bucket to that intermediate location and started EmrEtlRunner (and that appended region and bucket folder):

MOVE tracking-snapshots/events/snowplow-raw/_var_log_tomcat8_rotated_localhost_access_log.2017-09-26-12.eu-west-1.i-1a5c92f8e77605a3d.txt.gz -> snowplow-bucket-data/processing/_var_log_tomcat8_rotated_localhost_access_log.2017-09-26-12.eu-west-1.i-1a5c92f8e77605a3d.txt.eu-west-1.snowplow-raw.gz

Which probably caused this error:

D, [2017-09-26T17:32:18.692000 #9405] DEBUG -- : EMR jobflow j-3QMDJ1Q1LCRTR started, waiting for jobflow to complete...
F, [2017-09-26T17:42:20.277000 #9405] FATAL -- : 

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-3QMDJ1Q1LCRTR failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2017-09-26 17:38:51 UTC - ]
 - 1. Elasticity Scalding Step: Enrich Raw Events: COMPLETED ~ 00:01:57 [2017-09-26 17:38:56 UTC - 2017-09-26 17:40:53 UTC]
 - 2. Elasticity S3DistCp Step: Enriched HDFS -> S3: FAILED ~ 00:00:14 [2017-09-26 17:40:53 UTC - 2017-09-26 17:41:08 UTC]
 - 3. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 4. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
 - 5. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]):

Edit: EMR logs can provide more info

Input path does not exist: hdfs://ip-172-31-28-180.eu-west-1.compute.internal:8020/tmp/5a58078b-a2b2-4f4d-a0b8-8b90bf70e3bc/files

Any suggestion to bypass this issue?
How can files from different regions with the same timestamps live together?

TIA

BenFradet · September 27, 2017, 12:05pm

Can you give us the configured in buckets and their structure from your last message?

It seems to me like you have only one bucket?

Also, starting from R91, emr etl runner doesn’t do any renaming of the clojure log files so as long as you have different filenames initially you should be fine. I’m assuming you’re using an earlier version since it seems that the timestamps in the filenames have changed format between raw and processing. I’d advise updating to R92 directly.

T_P · September 27, 2017, 1:29pm

Hi @BenFradet

Currently only one bucket:

  raw:
    in:
       - "s3n://tracking-snapshots/events/snowplow-raw"
 #     - "s3n://elasticbeanstalk-eu-west-1-602232737466/resources/environments/logs/publish/e-smgk4gppuv"
 #     - "s3n://elasticbeanstalk-ap-southeast-1-602232737466/resources/environments/logs/publish/e-vbm84x2rwb"

Until now, I’ve only used the eu-west-1 bucket and everything worked fine.
But since we need a few more collectors I’ve then tried to use IN buckets from multi-regions but as @alex said it won’t work. So, I’ve created an intermediate bucket were I put all the logs for processing (tracking-snapshots/events/snowplow-raw) gathered at the locations commented above.

I’ve renamed the log files since I have _var_log_tomcat8_rotated_localhost_access_log.txt1506423661.gz for both locations ap-southeast-1 and eu-west-1.

Thank you for your help!

BenFradet · September 27, 2017, 1:45pm

Is upgrading to R92 not an option for you?

If you did upgrade you could just copy the e-vbm84x2rwb ap-southeast-1 directory to s3n://elasticbeanstalk-eu-west-1-602232737466/resources/environments/logs/publish/ and have:

raw:
    in:
       - s3n://elasticbeanstalk-eu-west-1-602232737466/resources/environments/logs/publish/e-smgk4gppuv
       - s3n://elasticbeanstalk-eu-west-1-602232737466/resources/environments/logs/publish/e-vbm84x2rwb

as in buckets.

Additionally you wouldn’t need to do any renaming.

T_P · September 27, 2017, 2:01pm

So you think the problem are the renamed files?

I’ll have to consider an upgrade but not knowing the problem I’m a little afraid that it may not solve everything.

BenFradet · September 27, 2017, 2:26pm

Yes, R92 will not rename nor flatten the directory structure of your logs, so there won’t be any overwrite.

We encourage people to upgrade because of:

T_P · September 27, 2017, 4:35pm

It seems we’re stuck in the past (R77)
Looking forward to the most recent version.

Thank you!

Topic		Replies	Views
Performance managing S3 buckets AWS batch pipeline (Legacy)	1	1483	May 4, 2017
EmrEtlRunner ArgumentError (AWS EMR API Error (ValidationException) Enrichment	9	2291	March 29, 2017
Error while Running EmrEtlRunner AWS batch pipeline (Legacy)	19	2389	September 22, 2017
Intermittent EMR failure: Unable to find a region via the region provider chain AWS batch pipeline (Legacy)	3	1972	July 3, 2018
EMR ETL in other region than us-east-1? AWS batch pipeline (Legacy)	2	2165	June 20, 2016

Multi region in buckets and EmrEtlRunner error

Related topics