Snowplow & Snowflake in different AWS regions

Happy weekend everyone!

I have reached the last stage of setting up Snowplow for the first time and got the EMR cluster with Snowflake transformer & loader running via dataflow-runner. However, the transformer job failed after less than half a minute complaining about a bucket being in the wrong region:

User class threw exception: 
shadeaws.services.s3.model.AmazonS3Exception: 
The bucket is in this region: us-west-2. 
Please use this region to retry the request (Service: Amazon S3; Status Code: 301; 
Error Code: PermanentRedirect; Request ID: ED792079BBB7BBF2; 
S3 Extended Request ID: Er/uKjOOYopiHZ0eoY4n8XCU7gPm4Ww1QVgyKVfkinKnJFcZhP17KbNlLMQotUbB+eiNj23ExC4=), S3 Extended Request ID: Er/uKjOOYopiHZ0eoY4n8XCU7gPm4Ww1QVgyKVfkinKnJFcZhP17KbNlLMQotUbB+eiNj23ExC4=

I suspect that this could be about our Snowflake warehouse being located in us-west-2 and the Snowplow pipeline being in us-west-1. But I don’t understand which bucket it’s trying to access in us-west-2, especially not in the transformer step. The enriched bucket, bucket for transformer output, and bucket for ETL logs are all in us-west-1. The only other bucket that I suspect could be involved is the snowplow hosted assets one to get the transformer and loader jars.

Short of setting up the pipeline again in us-west-2 (which I might want to do anyways just to have them in the same region), I don’t really know what to do here. Has anyone seen this before?

@boba, the error seems to indicate it’s related to the bucket location (us-west-2) that is different from the region specified in your Snowflake Loader configuration file (value of awsRegion). S3 bucket location and snowflakeRegion do not have to be in the same region but the configuration file has to reflect the actual regions where you have your resources deployed.

Hi @ihor, thanks for the reply! I just double-checked all the buckets I created are in us-west-1 (which is where I’m running the EMR cluster). The only bucket that I see in the configs which could be located in a different region is the snowplow-hosted-assets one

s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.5.0.jar

I don’t have any snowplow-related buckets in us-west-2, so I’m a bit lost where this error comes from.

@boba. I missed you are talking about Snowflake Transformer, not Loader. I think you need to check the location of both applications in your Snowflake playbook. If you run the job from us-west-1 then update the hosted assets bucket to be s3://snowplow-hosted-assets-us-west-1/... as well.

1 Like

Hi @ihor, I was not aware there are region-specific buckets for the hosted assets, this seems to have fixed it! Thanks for the support :slight_smile:

I am not sure if this is still related to my original problem, but the transformer application is still failing after about 40 seconds. Unfortunately, the error in stderr doesn’t tell me much about what’s going on:

20/02/25 00:44:01 INFO Client: Application report for application_1582591242040_0001 (state: RUNNING)
20/02/25 00:44:02 INFO Client: Application report for application_1582591242040_0001 (state: FINISHED)
20/02/25 00:44:02 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 172.31.11.112
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1582591421312
	 final status: FAILED
	 tracking URL: http://ip-172-31-11-215.us-west-1.compute.internal:20888/proxy/application_1582591242040_0001/
	 user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1582591242040_0001 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/02/25 00:44:02 INFO ShutdownHookManager: Shutdown hook called
20/02/25 00:44:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-46e533e4-5cb6-476f-ab96-568bcca0e4ee
Command exiting with ret '1'

I’ve found similar errors in many threads in this forum, but none of the solutions seemed applicable to me. The controller log tells me even less:

2020-02-25T00:43:27.026Z INFO HadoopJarStepRunner.Runner: startRun() called for s-Z68L2AITEFM8 Child Pid: 8649
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 1 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 40 seconds
2020-02-25T00:44:05.296Z INFO Step created jobs: 
2020-02-25T00:44:05.297Z WARN Step failed with exitCode 1 and took 40 seconds

Happy to provide more details (playbook, cluster, config) - but not sure what would be helpful to understand this.

One thing I noticed is that the snowplow-snowflake-manifest table in DynamoDB is still empty.

Hi @boba,

The original error is somewhere in $EMRLOGS/j-$JOBID/containers/application_1582591242040_0001