Solaris + xenserver + ovswitch

February 28, 2013

This has troubling me for quite some time, hopefully someone else can save a few hours by bumping in this post.

For some reason my Solaris 10 Virtual Machines on Xenserver failed when the Distributed Virtual Switch Controller was also running. I didn’t really troubleshoot the issue until recently since I could live without cross-server private networks. This no longer being the case I decided to look into it again.

Scroll forward a couple of hours and after losing quite some time on trying various tricks on the VM (disabling NIC checksum offload, lower MTUs etc) to no avail I concluded that it must be a hypervisor issue. Digging into the openvswitch tools revealed something interesting.

[root@xenserver ~]# ovs-vsctl list-ports xapi25

Specifically, for my Linux VMs only a vifX.Y interface was being added to the bridge, while for my Solaris ones both a tapX.Y and a vifX.Y. Clickety-click.

[root@xenserver ~]# ovs-vsctl del-port xapi25 tap47.0

Et voila! Network connectivity to the Solaris VM works like a charm. Now to make this change permanent:

[root@xenserver ~]# diff /etc/xensource/scripts/vif.orig /etc/xensource/scripts/vif
if [[ $dev != tap* ]]; then
> $vsctl --timeout=30 -- --if-exists del-port $dev -- add-port $bridge $dev $vif_details
> else
> echo Skipping command $vsctl --timeout=30 -- --if-exists del-port $dev -- add-port $bridge $dev $vif_details
> fi

I am not really certain of the ugly side-effects that this may have. But it does the trick for me.

Update 2013/03/10: A better workaround is to have the above behavior apply only to Solaris VMs. For example, assuming that these are based on the “Solaris 10 (experimental)” template, the following snippet skips the offending command only for the Solaris VMs:

if [[ $dev != tap* ]]; then
    $vsctl --timeout=30 -- --if-exists del-port $dev -- add-port $bridge $dev $vif_details
    xe vm-list dom-id=$DOMID params=name-description | grep 'Solaris 10' 2>&1 >/dev/null || \
        $vsctl --timeout=30 -- --if-exists del-port $dev -- add-port $bridge $dev $vif_details

A Linux based firewall sandwich

March 7, 2012

While Linux has well documented server based load balancing features for both Layer-7 and Layer-4, there is little documentation on how one can make a firewall sandwich. Somehow it seems that this is a hard enough problem that even long time LVS contributor Roberto Nibali can be quoted to say on the topic “Buy a commercial load balancer and be done with it. Spend the spare time with your wife and kids or go to the pub with your buddies”. Problem is I don’t have a wife, and girlfriend is too far away so …

The 40,000ft. view

The following diagram (click to view in full size) captures the 40,000 feet view of the testbed in use. In short we have:

  1. a home-brew load generator simulating thousands of HTTP clients and servers (courtesy of my colleagues Alex and Leonidas)
  2. a couple of Solaris based firewalls
  3. a Linux load balancer running LVS in direct routing mode
  4. a Linux router

The firewalls setup

While the Solaris boxes run a proprietary dayjob product they can be treated as a regular firewall. That is assuming an HTTP request going through firewall-1, the HTTP response should also come back through firewall-1. In the event the response returns through firewall-2, firewall-2 will drop it altogether. Hence one should be able to reproduce the setup with an arbitrary linux firewall in place of the firewalls that:

  1. accepts new connections on the ingress interface (VLAN 162)
  2. reject packets on the egress interface (VLAN 165) that do not belong to an existing connection

Putting aside the above, the networking setup of the firewalls is pretty straightforward, configure the interfaces and add the routes to the HTTP client and server subnets.

firewall-1# more /etc/hostname.e1000g16*
firewall-1# cat /etc/inet/static_routes
# File generated by route(1M) - do not edit.

The load balancer

Linux Virtual Server was chosen for the load balancer setup. I installed Ubuntu 11.10 and then a couple of extra packages:

lvs# apt-get -y install vlan ipvsadm

Once done I configured the ingress interface to communicate with the HTTP clients:

lvs# tail -10 /etc/network/interfaces
iface eth1 inet manual
up ifconfig eth1 up

auto eth1.162
iface eth1.162 inet static
vlan_raw_device eth1

Then I proceeded to the load balancer setup. I want my firewalls to handle all web traffic (rather than a traffic to a specific VIP), so I had to use the firewall mark approach. I mark all HTTP related packets:

lvs# iptables -t mangle -N DIVERT
lvs# iptables -t mangle -A PREROUTING -i eth1.162 -p tcp --dport 80 -j DIVERT
lvs# iptables -t mangle -A DIVERT -j MARK --set-mark 1
lvs# iptables -t mangle -A DIVERT -j ACCEPT

then load balance them to my firewalls:

lvs# ipvsadm -A -f 1 -s sh
lvs# ipvsadm -a -f 1 -r -g -w 100
lvs# ipvsadm -a -f 1 -r -g -w 100

then make sure that my packets do get delivered to IPVS, even if their destination IP is not on a local interface (this caused a never-ending frustration till I found it):

lvs# ip rule add fwmark 1 lookup 100
lvs# ip route add local dev lo table 100

then add a route back to the HTTP clients via the appropriate interface (beats me why it’s needed, perhaps a Linux networking expert can explain why):

lvs# ip route add via

That’s it. Firing up my load generator I get a bunch of packets on eth1.162 which get load balanced to my two firewalls. The firewalls propagate the packet to the egress router, the egress router gets a response …

The egress router

And all it has to do is return the response via the originating firewall. It’s pretty much impossible to determine the originating load balancer through L4 criteria; sure enough if you use source-IP hashing on ingress you can do destination-IP hashing-load balancing on the egress. But what if you’re doing round robin distribution?

Towards this end the most simple way to solve this issue is via MAC persistence. If a request came through MAC-address “11:22:33:44:55:66” then return it through the same MAC. So all I need to do is write down the MAC addresses of the firewalls’ egress interfaces. Damn, I love iptables & iproute2:

egress-router# iptables -A PREROUTING -t mangle -j CONNMARK --restore-mark
egress-router# iptables -A PREROUTING -t mangle -m mark ! --mark 0 -j ACCEPT
egress-router# iptables -t mangle -A PREROUTING -i eth2.165 -m mac --mac-source 00:0c:29:ef:a3:29 -j MARK --set-mark 1
egress-router# iptables -t mangle -A PREROUTING -i eth2.165 -m mac --mac-source 00:0c:29:c7:93:96 -j MARK --set-mark 2
egress-router# iptables -A POSTROUTING -t mangle -j CONNMARK --save-mark
egress-router# echo "101 firewall1" >> /etc/iproute2/rt_tables
egress-router# echo "102 firewall2" >> /etc/iproute2/rt_tables
egress-router# ip rule add fwmark 1 table firewall1
egress-router# ip rule add fwmark 2 table firewall2
egress-router# ip route add default via table firewall1
egress-router# ip route add via table firewall1
egress-router# ip route add default via table firewall2
egress-router# ip route add via table firewall2

What does the above snippet do:

  1. it uses a separate connection mark for traffic coming from each firewall; mark-1 for firewall-1, mark-2 for firewall-2
  2. it adds a couple of iproute2 routing tables, one for each firewall
  3. it adds a couple of iproute2 rules to look up the appropriate routing table, depending on the connection mark
  4. it adds all appropriate routes to the firewall specific routing tables

That’s it!

The above are enough to make the firewall sandwich work. Sure enough my load generator reports no errors, my firewall logs report traffic in both firewall with no errors, ipvsadm reports requests being load balanced:

lvs# ipvsadm -l --stats
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Conns InPkts OutPkts InBytes OutBytes
-> RemoteAddress:Port
FWM 1 674 9579 0 609892 0
-> 367 3853 0 261303 0
-> 307 5726 0 348589 0

Everything looks great!

What’s missing

Actually a lot. This was just a prototype to make a point to a C-level exec. Notably missing are server healthchecking and loadbalancing failover. But these are problems well understood with a robust solution one can easily google for.

Solaris, VMWare & VGT mode

February 16, 2012

Today I had the strangest of problems. In a VMWare based testbed with a bunch of mixed systems (F5 Virtual appliances, a Linux host, 3 Solaris servers) I was facing severe connectivity issues with the Solaris hosts. Specifically, with all systems connected on VLAN 162 (L3 addressing: anything TCP related from the Solaris hosts failed. F5 and linux Virtual machines had no problem whatsoever.

I quickly fired up my trusted tcpdump tool to figure out what’s wrong. Then I issued a simple ICMP from a Solaris host to the load balancer to see what happens:

solaris-1# ping is alive

[root@loadbalancer-1:Active] config # tcpdump -q -n -i ingress-lb not arp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ingress-lb, link-type EN10MB (Ethernet), capture size 108 bytes
19:43:05.048167 IP > ICMP echo request, id 12766, seq 0, length 64
19:43:05.048215 IP > ICMP echo reply, id 12766, seq 0, length 64

Nice. ICMP works. Everything looks nice in the packet capture. Now let’s try some TCP traffic for a change:

solaris-1:/root# telnet 22

[root@loadbalancer-1:Active] config # tcpdump -q -n -i ingress-lb not arp
19:44:06.816663 IP > tcp 0
19:44:07.949006 IP > tcp 0
19:44:10.215576 IP > tcp 0
19:44:14.730324 IP > tcp 0
19:44:23.739898 IP > tcp 0

Du-oh. The packet reaches the load balancer alright but the source IP is corrupted. Googling didn’t really help, other people have run into this or similar issues but no solution. Pinging my skilled colleague Leonidas didn’t help either, he was similarly baffled at what was happening as I was. And then it hit me.

solaris-1# echo "Clickety-click; disabling checksum offload" && echo "set ip:dohwcksum=0" >> /etc/system
Clickety-click; disabling checksum offload

solaris-1:/root# telnet 22
Connected to
Escape character is '^]'.

Uh! The joy! Too bad that 2′ after I figured this out Leonidas had signed off for the day and I can only brag about this in my blog 🙂

Apparmor (synonyms: selinux, crap)

February 8, 2012

Today’s fun was with apparmor. What was a simple MySQL statement to load a bunch of data from a file to a database:

mysql> LOAD DATA INFILE '/var/tmp/some_log_file'
-> INTO TABLE entries
ERROR 29 (HY000): File '/var/tmp/' not found (Errcode: 13)

… was constantly failing for no good reason. It took something like 30′ of pointless online searching until it hit me:

# tail -0f /var/log/syslog
Feb 8 19:11:44 hs21-a kernel: [15359.215686] type=1400 audit(1328721104.742:113): apparmor="DENIED" operation="open" parent=1 profile="/usr/sbin/mysqld" name="/var/tmp/" pid=15623 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=105 ouid=0

Well I guess it’s just like SELinux. There is a parallel universe out there where apparmor just works. Just not this one.



UPS Greece, you suck

December 27, 2011

Dear UPS,

I am not really certain if your Greek subsidiary is a partner that just carries your trademark or a full subsidiary … but it just outright sucks. And you may want to look into it.

I could waste a number of keystrokes on the matter, but a picture is worth a thousand words:

2 working days (20 & 21 December) for a package to travel from the US all the way to Greece. Hopefully 4 … FOUR! working days (22, 23, 27, 28 December) for the package to travel another ~440km from Athens to Katerini (1 hour away from Thessaloniki, the 2nd largest city of Greece).

On corporate responsibility and a shitty thomson TG585v8 DSL modem

December 27, 2011

Last night I ran into the strangest of problems. Having finally setup my HTPC in my living room I tried out XBMC and Constellation to conveniently control it from my iPad. Clicketty-click … and fail!

After struggling for something like 15′ trying in vain to figure out what idiotic mistake I had made I pulled out my laptop. After another 30′ or so, being unable to contact my HTPC through my laptop too, I found out that not even ARP is working. Afraid of a rootkit I started installing Wireshark on the HTPC. And after 5′ I was finding out in surprise that ARP broadcast requests were not even reaching the HTPC (?!?!).

Some googling later revealed that other people are facing the same problem: ARP simply fails with this DSL modem. And there is little info on whether this is a persistent problem. I can only tell that the problem was temporarily fixed by changing the encryption to WPA2 (vs. WPA+WAP2).

Who is to blame here? I will stand to my initial reaction. OTE, the largest ISP in Greece. True, they don’t build the firmware but they have selected and shipping and are getting paid for the hardware [*]. And if anyone still thinks that it’s not OTE to blame …

… I rest my case.

[*] One may argue that you get this specific CPE for free. Which is as free as a “free mobile phone with a two year contract”. Not free at all.

Selinux & POLA

July 21, 2011

Selinux is crap.Sorry redhat fun boys but its true.Not even in redhat’s documentation doesnt have enough info.

via E.Balaskas

My own experience with SELinux today? A Virtual Machine with a forgotten root password. OK, that’s easy, boot in single user mode, type passwd(1), enter the new root password, reboot. I mean the process is documented in a shitload of pages (example) and has been working like that since … I don’t know 1996? Should be a piece of cake, right?


You see this is SELinux. There are procedures to follow, “passwd root” just won’t work in single user mode and will exit immediately without a prompt. A well-defined procedure that has been working for ages is now broken. Oh well …

# echo 0 >/selinux/enforce
# passwd root
Changing password for user root.
New password:

Oh-well I am fairly certain that there is one out of more than a billion parallel universes where SELinux just works. Just one though.

References: POLA

Oracle VM server and RHEL-6 paravirtualized domU

July 14, 2011

This cost me something like 10′ of google search and 15′ troubleshooting. Writing it down so that it can cost the next person just 2′ of google search 🙂

Setting RedHat Enterprise Linux 6 (hereby RHEL6) as a paravirtualized guest is well documented. However the virt-install command generates a 404 error when run on an Oracle VM server. I used tcpdump(8) to promptly discover that virt-install attempts to retrieve /images/xen/vmlinuz instead of the proper /isolinux/vmlinuz. Clickety-click:

# pwd

# diff
                 kernel = grabber.urlopen("%s/images/xen/vmlinuz"
                 initrd = grabber.urlopen("%s/images/xen/initrd.img"
<                 kernel = open("%s/isolinux/vmlinuz" %(nfsmntdir,), "r")
                 kernel = open("%s/images/xen/vmlinuz" %(nfsmntdir,), "r")
>                 initrd = open("%s/images/xen/initrd.img" %(nfsmntdir,), "r")

Then firing up virt-install again did the trick (remember to choose a suitable mirror):

# virt-install -n centos6 -r 2048 -f /OVS/publish_pool/centos6.disk.0 \
  --os-type=linux --vnc -p -l \ -b br0 -d

Extra notes: [1] [2]. I only used ext2 for the /boot filesystem but YMMV.

vpnc & windows 7: sleep a little bit

February 18, 2011

For quite some time I’ve been using vpnc within Cygwin to connect to the aging Cisco VPN 3000 Series Concentrator at dayjob (thank you Cisco for not supporting 64-bit users as Ilias points out in the comments Cisco has finally added partial support for Windows 7 64-bit). However, I’ve been running into the erratic problem where my split tunnels were created eratically and didn’t work. Specifically, once a VPN connection got created route print indicated routes similar to the following:

#route print
Network Destination        Netmask          Gateway       Interface  Metric     31

instead of the proper one:

Network Destination        Netmask          Gateway       Interface  Metric         On-link     31

I’ve conveniently ignored the problem for some time, using a custom script to tear down and re-create the problematic routing entries, till today. Some well placed “echos” in /etc/vpnc/vpnc-script-win.js indicated that vpnc properly constructed the required route add commands, yet the routing table entries were still wrong. Clickety-click:

$ diff /etc/vpnc/vpnc-script-win.js /etc/vpnc/vpnc-script-win-BEDC.js
$ diff -U 1 /etc/vpnc/vpnc-script-win.js /etc/vpnc/vpnc-script-win-BEDC.js
--- /etc/vpnc/vpnc-script-win.js        2010-09-18 13:13:25.778339100 +0300
+++ /etc/vpnc/vpnc-script-win-BEDC.js   2011-02-18 21:35:53.279264500 +0200
@@ -80,2 +80,4 @@
         if (env("CISCO_SPLIT_INC")) {
+               echo("sleeping a little bit; don't ask why but this is needed");
+               run("sleep 5");
                for (var i = 0 ; i < parseInt(env("CISCO_SPLIT_INC")); i++) {

Seems that a timing issue of some sort causes these route add commands to run prematurely, before the TAP tunnel interface is properly configured, resulting in a problematic configuration. Holding them back for just 5 seconds consistently does the trick for me.

Update: if generally interested in configuring VPNC with Windows, check out Alessio Molteni’s detailed post.
Update 2: Corrected status of the official Cisco VPN client.

Opennebula: dhcpd contextualization magic

February 17, 2011

One of the most frequent questions on the Opennebula lists relates to network contextualization of Virtual Machines (VMs). Specifically, contrary to Eucalyptus or Nimbus, Opennebula does not directly manage a DHCP server. Instead Opennebula:

  • suggests using a simple rule for extracting the IPv4 address from the MAC address within the VM
  • manages just MAC addresses

This moves the burden of IPv4 configuration to the VM operating system, which has to dynamically calculate the IPv4 address details based on each interface MAC address. While Opennebula provides a relevant sample VM template and script to do this, it comes up a little bit short. Specifically, the script is linux specific, it will probably not work with other Unix O/S like Solaris or FreeBSD, let alone Windows. In addition, extra work is required in order to configure additional but required network parameters, like a default router or a DNS server.
Read the rest of this entry »