Load into Redshift fails from EC2

rgabo · May 15, 2017, 7:04am

Hello Snowplowers,

StorageLoader started consistently failing in our production Snowplow pipeline since we have moved the processing to a different set of instances. Here’s how it looks like:

The following have changed:

Pipeline now uses Airflow instead of Luigi for orchestration (should not matter)
StorageLoader runs in a separate VPC and talks to Redshift through VPC peering
EC2 instances running the Docker image use a different AMI

After a lot of trial and error and searching the interwebs, I stumbled upon the following issue which is my best bet right now: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-drop-issues.html

Loading works flawlessly when executed from my MacBook so it’s definitely an environmental / configuration issue. I’ll keep trying but wanted to create a topic so that I can a) hear from anyone else who faced the same issue b) document the solution once I find it.

alex · May 15, 2017, 8:31am

Hi @rgabo - yes, I suspect that is your problem. We have had MTU problems in the past before we correctly configured those VPCs…

rgabo · May 15, 2017, 12:01pm

Thanks for confirming that you had MTU problems in the past, @alex. Were you able to fix those by allowing ICMP traffic to properly flow or did you reconfigure MTU on the worker machine(s)? Any guidance is greatly appreciated, I was not able to fix the issue by allowing ICMP traffic so I’m still in trial and error mode.

alex · May 15, 2017, 12:16pm

Hey @rgabo - we just fixed those issues by letting ICMP traffic flow around the networks properly. In our experience, problems around database connectivity have always traced back to either network or (database software) driver configuration, very rarely box-related settings.

rgabo · May 15, 2017, 3:34pm

tldr; http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-firewall-guidance.html

It seems like that it wasn’t an MTU, rather a TCP keepalive issue. Our Redshift cluster is not in the same VPC as the worker instances and somehow the weird network topology that includes Kubernetes (kubenet), Docker networking, NAT gateway, the connection was dropped and resulted in the terminated COPY queries.

Since the TCP keepalive settings were configured on the worker hosts according to Amazon’s guidance, the load seems to be progressing well. I do not know if reverting MTU to 9001 on the host would work, but I would guess yes, because we’ve seen pmtu negotiated properly when running tracepath <your redshift host here>. Although that probably needed the firewall tweak to let inbound ICMP traffic flow.

At any rate, quite annoying network issue that one shouldn’t have to deal with. I hope the thread will help others.

Topic		Replies	Views
Storage Loader successful but not loading Redshift or Postgres DB Storage targets	4	2031	March 28, 2017
R90 storage loading problems Troubleshooting	9	2287	October 19, 2017
Storage loader failure For engineers	3	1216	November 22, 2016
Snowplow-rdb-loader timing out For engineers	2	740	March 5, 2020
Storage Loader Error -- Need Help Redshift	4	1576	November 21, 2016

Load into Redshift fails from EC2

Related topics