OK…checking the logs of the FIRST time this happened, I see more detailed logs of the error with Excon:
Excon::Error::ServiceUnavailable (Expected(200) <=> Actual(503 Service Unavailable)
excon.error.response
:body => "<Error><Code>SlowDown</Code><Message>Please reduce your request rate.</Message><RequestId>3F7E4F0929844761</RequestId><HostId>r4/0CYAG09ZaxxiSyXZJhAAEWGygJzwMGY30TPFj+pH4+LeIbkfXZKDTtsfxwbcpbS4PHu7DaP8=</HostId></Error>"
:cookies => [
]
:headers => {
"Connection" => "close"
"Content-Type" => "application/xml"
"Date" => "Thu, 14 Sep 2017 19:05:43 GMT"
"Server" => "AmazonS3"
"x-amz-id-2" => "r4/0CYAG09ZaxxiSyXZJhAAEWGygJzwMGY30TPFj+pH4+LeIbkfXZKDTtsfxwbcpbS4PHu7DaP8="
"x-amz-request-id" => "3F7E4F0929844761"
}
:host => "in-stats-prod-archive.s3.amazonaws.com"
:local_address => "<internal-ip-of-emr-runner>"
:local_port => 44096
:path => "/raw/2017-09-14-49573367394269636484214968494006254435221012695805329410-49573367394269636484214968517991342696375263768313069570.lzo.index"
:port => 443
:reason_phrase => "Slow Down"
:remote_ip => "52.216.230.139"
:status => 503
:status_line => "HTTP/1.1 503 Slow Down\r\n"
):
uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/expects.rb:7:in `response_call'
uri:classloader:/gems/excon-0.52.0/lib/excon/middlewares/response_parser.rb:9:in `response_call'
uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:388:in `response'
uri:classloader:/gems/excon-0.52.0/lib/excon/connection.rb:252:in `request'
uri:classloader:/gems/fog-xml-0.1.2/lib/fog/xml/sax_parser_connection.rb:35:in `request'
uri:classloader:/gems/fog-xml-0.1.2/lib/fog/xml/connection.rb:7:in `request'
uri:classloader:/gems/fog-1.25.0/lib/fog/aws/storage.rb:521:in `_request'
uri:classloader:/gems/fog-1.25.0/lib/fog/aws/storage.rb:516:in `request'
uri:classloader:/gems/fog-1.25.0/lib/fog/aws/requests/storage/copy_object.rb:32:in `copy_object'
uri:classloader:/gems/fog-1.25.0/lib/fog/aws/models/storage/file.rb:92:in `copy'
uri:classloader:/gems/sluice-0.4.0/lib/sluice/storage/s3/s3.rb:622:in `block in retry_x'
org/jruby/ext/timeout/Timeout.java:117:in `timeout'
uri:classloader:/gems/sluice-0.4.0/lib/sluice/storage/s3/s3.rb:621:in `retry_x'
uri:classloader:/gems/sluice-0.4.0/lib/sluice/storage/s3/s3.rb:548:in `block in process_files'
org/jruby/RubyKernel.java:1295:in `loop'
uri:classloader:/gems/sluice-0.4.0/lib/sluice/storage/s3/s3.rb:412:in `block in process_files'
Problem destroying processing/2017-09-14-49573367394269636484214968352265746714503765354478567426-49573367394269636484214968375920798226903222662291521538.lzo.index. Retrying.
Problem destroying processing/2017-09-14-49573367394269636484214968518667132229539841820571205634-49573367394269636484214968543292951175089846286953742338.lzo. Retrying.
Error running EmrEtlRunner, exiting with return code 1. StorageLoader not run
So it does seem related to the aws setup in some way? Hope this helps understand the issue…
The EMRETLRunner version I am currently using is: r83 (bald eagle)
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_emr_r83_bald_eagle.zip
I had tried to update the version earlier but that did result in breaking or conflicts hence I kept this version. The platform has been stable for the past 5-6 months until this happened.
UPDATE: I found that this error ha been logged in the the forums earlier as well. That error stemmed from the buckets and the EMR runner being in different regions; but I can confirm that all my buckets, rdb and runners are in the same region. Again, it seems to happen mostly when moving large amounts of data in step 10 of the pipeline (moving from processing to raw:archive). Assuming I can temporarily resume manually, what would be the command I would need to run? As per my above message, I tried it manually but no data got loaded into the rdb!
UPDATE 2: I found that many users of S3 have been seeing this message during large transfers, especially on that day (9/14). Here’s one forum describing it. Maybe it was a temporarily glitch? If so, how can I resume from where the runner left off?
Thanks again!