Xen guests isolation and ebtables concurrency

Concurrency is evil

When using bridging for Xen Networking and your guests machines (domUs in Xen parlance) are fully managed by third parties, some sort of isolation is specially needed. A rogue admin can change the IP and/or MAC address(es) assigned to its domU and potentially cause an IP address conflict.

Xen provides an script called vif-bridge that takes care of adding domU’s virtual interfaces to dom0’s bridge, bring them up and add iptables rules allowing datagrams whose source is one of the assigned IP address(es) coming in through domU’s virtual interfaces.

Those iptables rules might be not enough. They don’t enforce usage of the assigned MAC addresses and could interfere with current deployed firewall. Another point, in my opinion, is that these addresses policies belong to Link Layer (bridge decision) instead of Network Layer (see PacketFlow), so I prefer to have them enforced with ebtables.

I picked then one of the existing scripts of vif-bridge with ebtables and adapted it to only allow flow of assigned IP/MAC pairs and ARP requests/replies.

After deploying the adapted vif-bridge, domU creation began to fail randomly.  Some debug code added at the beginning of the script threw some bizarre errors:

+ ebtables -F veth2250_IN
ebtables v2.0.9-2:communication.c:388:--BUG--:
Couldn't update kernel counters
++ sigerr

+ ebtables -N veth639a_IN
+ ebtables -P veth639a_IN DROP
Chain 'veth639a_IN' doesn't exist.
++ sigerr

+ ebtables -A veth639_OUT -p arp --arp-ip-dst -j ACCEPT
+ ebtables -A veth639_OUT -p arp --arp-ip-dst -j ACCEPT
The kernel doesn't support a certain ebtables extension, consider recompiling your kernel or insmod the extension.
++ sigerr

As you can see those ebtables errors are triggered by correct trivial calls. To make it worse, chain, interface and rule names varied from one error to other. Looking some help for “Couldn’t update kernel counters” or “communication.c:388:–BUG–:” didn’t help at all.

While debugging, I learned that an instance of vif-bridge is run by Xen for each defined network interface and they all are run in parallel. All my domU have two virtual network interfaces defined.

At that point I had no clue about the problem’s cause. I decided to upgrade ebtables to discard those “make sure you’re running the last version” support advises (squeeze’s version is, upstream is 2.0.10). With the new version I began to see this new error in logs:

+ ebtables -A FORWARD -o veth2450 -p ip4 -d 00:16:3d:1c:26:4a --ip-dst -j ACCEPT
Unable to update the kernel. Two possible causes:
1. Multiple ebtables programs were executing simultaneously. The ebtables
   userspace tool doesn't by default support multiple ebtables programs running
   concurrently. The ebtables option --concurrent or a tool like flock can be
   used to support concurrent scripts that update the ebtables kernel tables.
2. The kernel doesn't support a certain ebtables extension, consider
   recompiling your kernel or insmod the extension.

After reading this I did immediately understand what was happening. That error description couldn’t be more clear and I thank upstream author for it. I never considered any concurrency problem in ebtables, not even after seeing random illogical errors generated by trivial rules.

--concurrent is available in 2.0.10 so I took the flock way, the fixed script is here.

Later I found the problem description in ebtables’ basic examples page:

Updating the ebtables kernel tables is a two-phase process. First, the userspace program sends the new table to the kernel and receives the packet counters for the rules in the old table. In a second phase, the userspace program uses these counter values to determine the initial counter values of the new table, which is already active in the kernel. These values are sent to the kernel which adds these values to the kernel’s counter values. Due to this two-phase process, it is possible to confuse the ebtables userspace tool when more than one instance is run concurrently. Note that even in a one-phase process it would be possible to confuse the tool.

It might be very difficult to reproduce the errors shown above if you don’t have more than one network interface in your domUs and your vif-bridge script have more than a few ebtables rules.

Summarizing. If you:

high chances are that your scripts are running ebtables concurrently. Just fix them.