Posts Tagged ‘dayjob’

Solaris, VMWare & VGT mode

February 16, 2012

Today I had the strangest of problems. In a VMWare based testbed with a bunch of mixed systems (F5 Virtual appliances, a Linux host, 3 Solaris servers) I was facing severe connectivity issues with the Solaris hosts. Specifically, with all systems connected on VLAN 162 (L3 addressing: 172.16.2.0/24) anything TCP related from the Solaris hosts failed. F5 and linux Virtual machines had no problem whatsoever.

I quickly fired up my trusted tcpdump tool to figure out what’s wrong. Then I issued a simple ICMP from a Solaris host to the load balancer to see what happens:

solaris-1# ping 172.16.2.1
172.16.2.1 is alive


[root@loadbalancer-1:Active] config # tcpdump -q -n -i ingress-lb not arp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ingress-lb, link-type EN10MB (Ethernet), capture size 108 bytes
19:43:05.048167 IP 172.16.2.21 > 172.16.2.1: ICMP echo request, id 12766, seq 0, length 64
19:43:05.048215 IP 172.16.2.1 > 172.16.2.21: ICMP echo reply, id 12766, seq 0, length 64

Nice. ICMP works. Everything looks nice in the packet capture. Now let’s try some TCP traffic for a change:

solaris-1:/root# telnet 172.16.2.1 22
Trying 172.16.2.1...


[root@loadbalancer-1:Active] config # tcpdump -q -n -i ingress-lb not arp
19:44:06.816663 IP 172.16.233.49.windb > 172.16.2.1.ssh: tcp 0
19:44:07.949006 IP 172.16.233.48.windb > 172.16.2.1.ssh: tcp 0
19:44:10.215576 IP 172.16.233.47.windb > 172.16.2.1.ssh: tcp 0
19:44:14.730324 IP 172.16.233.46.windb > 172.16.2.1.ssh: tcp 0
19:44:23.739898 IP 172.16.85.195.windb > 172.16.2.1.ssh: tcp 0

Du-oh. The packet reaches the load balancer alright but the source IP is corrupted. Googling didn’t really help, other people have run into this or similar issues but no solution. Pinging my skilled colleague Leonidas didn’t help either, he was similarly baffled at what was happening as I was. And then it hit me.

solaris-1# echo "Clickety-click; disabling checksum offload" && echo "set ip:dohwcksum=0" >> /etc/system
Clickety-click; disabling checksum offload

solaris-1:/root# telnet 172.16.2.1 22
Trying 172.16.2.1...
Connected to 172.16.2.1.
Escape character is '^]'.
SSH-2.0-OpenSSH_4.3

Uh! The joy! Too bad that 2′ after I figured this out Leonidas had signed off for the day and I can only brag about this in my blog 🙂

Advertisements

Going live

March 18, 2010

A go-live for a new product is always a rewarding experience. Months of development and QA capitalize on the first customer adopting the product and its subscribers starting to use it. A “switch” in a load balancer -of some sort- is flipped and traffic starts flowing through your newly deployed solution. Engineers are happy that months of hard work is finally getting some real mileage. Sales is happy that revenue will be recognized soon. Professional Services are making travel reservations for the location of the next customer. And then …

And then all hell breaks loose. Stability issues that the QA process had missed are identified. Corner case scenarios that no-one could think off are run into and expose holes in the design or implementation. Customer issues are hard to reproduce in your own lab, so as to investigate and fix, requiring you to hack tests in a language you haven’t touched since ages ago. You pound at code with undocumented memory allocator settings. You find yourself reaching for the ancient scriptures, trying to troubleshoot and nail down an issue. You’re analyzing data from the now live systems in real-time, trying to gauge the magnitude of the problem, identify whether it’s a software or an integration issue. You find yourself working at post-midnight hours and still waking up at 06:00 am. You try to figure out workarounds till the hard at work developers manage to fix the underlying problems. You come up with aggressive acceptance test plans that optimally balance between the need for quality and quick turnaround. You operate under pressure, the customer requesting an immediate workaround in order not to rollback to the previously installed solution.

We have very good news. Both actions against the problem work as planned. This means that the highest priority issue of the customer is now solved.

You grab a cold beer. You can feel the lack of sleep creeping in your aging bones and mind, yet can’t really sleep. You remember a similar crisis a couple of years ago, with another product and another customer back at the time, you smile and think that I’m still not too old for this shit.

[Dedicated to Angie, Nikos, Petros and everyone else hard at work during the last 3 days; couldn’t have done it without you team]