Multi region in buckets and EmrEtlRunner error

Hi,

Because of some latency we had the necessity to try a secondary collector in a different region.
Primary runs on eu-west-1, the secondary runs on ap-southeast-1.

s3:
region: eu-west-1
buckets:
raw:
in:
- “s3n://elasticbeanstalk-eu-west-1-602XXX746X/resources/environments/logs/publish/e-bxvXXXd84p”
- “s3n://elasticbeanstalk-ap-southeast-1-602XXX746X/resources/environments/logs/publish/e-vbmXXX2rwb”

First IN bucket is fetched as always without any problem. Now, when the second bucket was added an error is return when trying to run EmrEtlRunner:

The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.

I’ve tried with the indicated endpoint and some small changes but ain’t got any success.

Anyone kind to help?

TIA,

Yes, that won’t work - you are specifying the s3:region as eu-west-1 and then your second bucket is in a different region.

@alex
Is there any workaround that can be tried without going through the creation of a new EMR instance in the secondary region?

Right - you can set up a manual move of the files using $ aws s3 mv to get the files all into the same region.

@alex

I’ve gathered the logs from both regions to an intermediate bucket on the same location as the EMR.
During that process I had to rename the files (see below) so they don’t get overwrited, because I’ve got the exact same log files both from ap-southeast-1 and eu-west-1 region.

resources/environments/logs/publish/e-smgk4gppuv/i-1a5c92f8e77605a3d _var_log_tomcat8_rotated_localhost_access_log.txt1506423661 -> _var_log_tomcat8_rotated_localhost_access_log.2017-09-26-12.eu-west-1.i-1a5c92f8e77605a3d.txt.gz

Then setted the IN bucket to that intermediate location and started EmrEtlRunner (and that appended region and bucket folder):

MOVE tracking-snapshots/events/snowplow-raw/_var_log_tomcat8_rotated_localhost_access_log.2017-09-26-12.eu-west-1.i-1a5c92f8e77605a3d.txt.gz -> snowplow-bucket-data/processing/_var_log_tomcat8_rotated_localhost_access_log.2017-09-26-12.eu-west-1.i-1a5c92f8e77605a3d.txt.eu-west-1.snowplow-raw.gz

Which probably caused this error:

D, [2017-09-26T17:32:18.692000 #9405] DEBUG -- : EMR jobflow j-3QMDJ1Q1LCRTR started, waiting for jobflow to complete...
F, [2017-09-26T17:42:20.277000 #9405] FATAL -- : 

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-3QMDJ1Q1LCRTR failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2017-09-26 17:38:51 UTC - ]
 - 1. Elasticity Scalding Step: Enrich Raw Events: COMPLETED ~ 00:01:57 [2017-09-26 17:38:56 UTC - 2017-09-26 17:40:53 UTC]
 - 2. Elasticity S3DistCp Step: Enriched HDFS -> S3: FAILED ~ 00:00:14 [2017-09-26 17:40:53 UTC - 2017-09-26 17:41:08 UTC]
 - 3. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 4. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
 - 5. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]):

Edit: EMR logs can provide more info

Input path does not exist: hdfs://ip-172-31-28-180.eu-west-1.compute.internal:8020/tmp/5a58078b-a2b2-4f4d-a0b8-8b90bf70e3bc/files

Any suggestion to bypass this issue?
How can files from different regions with the same timestamps live together?

TIA

Can you give us the configured in buckets and their structure from your last message?

It seems to me like you have only one bucket?

Also, starting from R91, emr etl runner doesn’t do any renaming of the clojure log files so as long as you have different filenames initially you should be fine. I’m assuming you’re using an earlier version since it seems that the timestamps in the filenames have changed format between raw and processing. I’d advise updating to R92 directly.

Hi @BenFradet

Currently only one bucket:

  raw:
    in:
       - "s3n://tracking-snapshots/events/snowplow-raw"
 #     - "s3n://elasticbeanstalk-eu-west-1-602232737466/resources/environments/logs/publish/e-smgk4gppuv"
 #     - "s3n://elasticbeanstalk-ap-southeast-1-602232737466/resources/environments/logs/publish/e-vbm84x2rwb" 

Until now, I’ve only used the eu-west-1 bucket and everything worked fine.
But since we need a few more collectors I’ve then tried to use IN buckets from multi-regions but as @alex said it won’t work. So, I’ve created an intermediate bucket were I put all the logs for processing (tracking-snapshots/events/snowplow-raw) gathered at the locations commented above.

I’ve renamed the log files since I have _var_log_tomcat8_rotated_localhost_access_log.txt1506423661.gz for both locations ap-southeast-1 and eu-west-1.

Thank you for your help!

Is upgrading to R92 not an option for you?

If you did upgrade you could just copy the e-vbm84x2rwb ap-southeast-1 directory to s3n://elasticbeanstalk-eu-west-1-602232737466/resources/environments/logs/publish/ and have:

raw:
    in:
       - s3n://elasticbeanstalk-eu-west-1-602232737466/resources/environments/logs/publish/e-smgk4gppuv
       - s3n://elasticbeanstalk-eu-west-1-602232737466/resources/environments/logs/publish/e-vbm84x2rwb

as in buckets.

Additionally you wouldn’t need to do any renaming.

So you think the problem are the renamed files?

I’ll have to consider an upgrade but not knowing the problem I’m a little afraid that it may not solve everything.

Yes, R92 will not rename nor flatten the directory structure of your logs, so there won’t be any overwrite.

We encourage people to upgrade because of:

It seems we’re stuck in the past (R77) :sweat_smile:
Looking forward to the most recent version.

Thank you!