StorageLoader started consistently failing in our production Snowplow pipeline since we have moved the processing to a different set of instances. Here’s how it looks like:
The following have changed:
- Pipeline now uses Airflow instead of Luigi for orchestration (should not matter)
- StorageLoader runs in a separate VPC and talks to Redshift through VPC peering
- EC2 instances running the Docker image use a different AMI
After a lot of trial and error and searching the interwebs, I stumbled upon the following issue which is my best bet right now: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-drop-issues.html
Loading works flawlessly when executed from my MacBook so it’s definitely an environmental / configuration issue. I’ll keep trying but wanted to create a topic so that I can a) hear from anyone else who faced the same issue b) document the solution once I find it.
Hi @rgabo - yes, I suspect that is your problem. We have had MTU problems in the past before we correctly configured those VPCs…
Thanks for confirming that you had MTU problems in the past, @alex. Were you able to fix those by allowing ICMP traffic to properly flow or did you reconfigure MTU on the worker machine(s)? Any guidance is greatly appreciated, I was not able to fix the issue by allowing ICMP traffic so I’m still in trial and error mode.
Hey @rgabo - we just fixed those issues by letting ICMP traffic flow around the networks properly. In our experience, problems around database connectivity have always traced back to either network or (database software) driver configuration, very rarely box-related settings.
It seems like that it wasn’t an MTU, rather a TCP keepalive issue. Our Redshift cluster is not in the same VPC as the worker instances and somehow the weird network topology that includes Kubernetes (kubenet), Docker networking, NAT gateway, the connection was dropped and resulted in the terminated COPY queries.
Since the TCP keepalive settings were configured on the worker hosts according to Amazon’s guidance, the load seems to be progressing well. I do not know if reverting MTU to 9001 on the host would work, but I would guess yes, because we’ve seen
pmtu negotiated properly when running
tracepath <your redshift host here>. Although that probably needed the firewall tweak to let inbound ICMP traffic flow.
At any rate, quite annoying network issue that one shouldn’t have to deal with. I hope the thread will help others.