StorageLoader started consistently failing in our production Snowplow pipeline since we have moved the processing to a different set of instances. Here’s how it looks like:
Loading works flawlessly when executed from my MacBook so it’s definitely an environmental / configuration issue. I’ll keep trying but wanted to create a topic so that I can a) hear from anyone else who faced the same issue b) document the solution once I find it.
Thanks for confirming that you had MTU problems in the past, @alex. Were you able to fix those by allowing ICMP traffic to properly flow or did you reconfigure MTU on the worker machine(s)? Any guidance is greatly appreciated, I was not able to fix the issue by allowing ICMP traffic so I’m still in trial and error mode.
Hey @rgabo - we just fixed those issues by letting ICMP traffic flow around the networks properly. In our experience, problems around database connectivity have always traced back to either network or (database software) driver configuration, very rarely box-related settings.
It seems like that it wasn’t an MTU, rather a TCP keepalive issue. Our Redshift cluster is not in the same VPC as the worker instances and somehow the weird network topology that includes Kubernetes (kubenet), Docker networking, NAT gateway, the connection was dropped and resulted in the terminated COPY queries.
Since the TCP keepalive settings were configured on the worker hosts according to Amazon’s guidance, the load seems to be progressing well. I do not know if reverting MTU to 9001 on the host would work, but I would guess yes, because we’ve seen pmtu negotiated properly when running tracepath <your redshift host here>. Although that probably needed the firewall tweak to let inbound ICMP traffic flow.
At any rate, quite annoying network issue that one shouldn’t have to deal with. I hope the thread will help others.