Today I had the strangest of problems. In a VMWare based testbed with a bunch of mixed systems (F5 Virtual appliances, a Linux host, 3 Solaris servers) I was facing severe connectivity issues with the Solaris hosts. Specifically, with all systems connected on VLAN 162 (L3 addressing: 172.16.2.0/24) anything TCP related from the Solaris hosts failed. F5 and linux Virtual machines had no problem whatsoever.
I quickly fired up my trusted tcpdump tool to figure out what’s wrong. Then I issued a simple ICMP from a Solaris host to the load balancer to see what happens:
solaris-1# ping 172.16.2.1 172.16.2.1 is alive [root@loadbalancer-1:Active] config # tcpdump -q -n -i ingress-lb not arp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ingress-lb, link-type EN10MB (Ethernet), capture size 108 bytes 19:43:05.048167 IP 172.16.2.21 > 172.16.2.1: ICMP echo request, id 12766, seq 0, length 64 19:43:05.048215 IP 172.16.2.1 > 172.16.2.21: ICMP echo reply, id 12766, seq 0, length 64
Nice. ICMP works. Everything looks nice in the packet capture. Now let’s try some TCP traffic for a change:
solaris-1:/root# telnet 172.16.2.1 22 Trying 172.16.2.1... [root@loadbalancer-1:Active] config # tcpdump -q -n -i ingress-lb not arp 19:44:06.816663 IP 172.16.233.49.windb > 172.16.2.1.ssh: tcp 0 19:44:07.949006 IP 172.16.233.48.windb > 172.16.2.1.ssh: tcp 0 19:44:10.215576 IP 172.16.233.47.windb > 172.16.2.1.ssh: tcp 0 19:44:14.730324 IP 172.16.233.46.windb > 172.16.2.1.ssh: tcp 0 19:44:23.739898 IP 172.16.85.195.windb > 172.16.2.1.ssh: tcp 0
Du-oh. The packet reaches the load balancer alright but the source IP is corrupted. Googling didn’t really help, other people have run into this or similar issues but no solution. Pinging my skilled colleague Leonidas didn’t help either, he was similarly baffled at what was happening as I was. And then it hit me.
solaris-1# echo "Clickety-click; disabling checksum offload" && echo "set ip:dohwcksum=0" >> /etc/system Clickety-click; disabling checksum offload solaris-1:/root# telnet 172.16.2.1 22 Trying 172.16.2.1... Connected to 172.16.2.1. Escape character is '^]'. SSH-2.0-OpenSSH_4.3
Uh! The joy! Too bad that 2′ after I figured this out Leonidas had signed off for the day and I can only brag about this in my blog 🙂