Hide Forgot
Description of problem: On a host with multiple physical network interfaces connecting to multiple switches, STP must run on a bridge interface to prevent network loops. Virtual guests connecting to that bridge then encounter a forwarding delay when their virtual interface is created. That delay is likely to be longer than the pxe/dhcp wait time in the gpxe bios, so the guest fails to boot when they attempt to boot from the network. We need either a configurable startup delay between the interface creation and the guest startup, or the code which creates the guest virtual network interface should query the bridge to get the STP forwarding delay and automatically wait until the interface enters forwarding state before allowing guest startup to continue. OR, we need a configurable delay in the gPXE dhcp client to allow it to wait long enough to get an address after the new interface enters forwarding state. Version-Release number of selected component (if applicable): All RHEL6 appears to be affected. Tested on 6.2 with libvirt-0.9.4-23.el6_2.4.x86_64, kernel-2.6.32-220.4.1.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Connect a host to 2 network switches via 2 physical interfaces, configure one bridge interface to connect those 2 physical interfaces. Do NOT disable STP on the bridge interface. Use 'brctl showstp <bridgename>' to observe the forwarding delay and other parameters learned from the root bridge. 2. Create a virtual guest connecting via the host bridge interface 3. Start the guest, observe DHCP timeout in guest. /var/log/messages will show the 'vnet' interface going through 'listening', 'learning', and 'forwarding' states. The transition to 'forwarding' state happens after the guest DHCP timeout. Actual results: Guest fails gPXE DHCP Expected results: Guest should get a DHCP response and boot. Additional info: # brctl show bridge name bridge id STP enabled interfaces intbr0 8000.441ea103e078 yes eth0 eth2 vnet3 # brctl showstp intbr0 intbr0 bridge id 8000.441ea103e078 designated root 60d8.7081058bdac0 root port 1 path cost 4 max age 19.99 bridge max age 19.99 hello time 1.99 bridge hello time 1.99 forward delay 14.99 bridge forward delay 14.99 ageing time 299.95 hello timer 0.00 tcn timer 0.00 topology change timer 0.00 gc timer 9.68 hash elasticity 4 hash max 512 mc last member count 2 mc init query count 2 mc router 1 mc snooping 1 mc last member timer 0.99 mc membership timer 259.96 mc querier timer 254.96 mc query interval 124.98 mc response interval 9.99 mc init query interval 31.24 flags # /var/log/messages after "virsh create Guest" Feb 9 13:35:30 dmzsrv4 kernel: device vnet3 entered promiscuous mode Feb 9 13:35:30 dmzsrv4 kernel: intbr0: port 4(vnet3) entering listening state Feb 9 13:35:45 dmzsrv4 kernel: intbr0: port 4(vnet3) entering learning state Feb 9 13:36:00 dmzsrv4 kernel: intbr0: topology change detected, sending tcn bpdu Feb 9 13:36:00 dmzsrv4 kernel: intbr0: port 4(vnet3) entering forwarding state # interface config from the Guest.xml file: <interface type='bridge'> <mac address='52:54:00:84:94:04'/> <source bridge='intbr0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface>
See also Fedora bug 586324
As I understand, setting the STP delay to 0 in your case isn't acceptable, because the bridge is connected to multiple physical devices so might create a loop. It really is unfortunate that the STP delay is only configurable for the bridge as a whole, and can't be set independently for each interface attached to a bridge (and beyond that, as you point out in bug 586324, the delay setting from the root bridge on the network will eventually override any local config anyway) - if individual interfaces could be attached to the bridge with stp=no or delay=0, our problem would be solved. A bit of brainstorming before we resign ourselves to building in a delay to guest startup: Since a virt guest will obviously not create a loop (unless the guest itself also contains a bridge device that is in turn connected to multiple physical networks), it would be no problem if the delay was 0 when a guest's interface is attached - the time when a non-0 delay is needed is when the original physical interfaces are attached to the bridge. Since a delay of up to 30 seconds would be fairly common, and a very long time to wait for a new guest to start, maybe we should instead look for a solution that attaches some other bridgelike device to the bridge just once at libvirtd startup (suffering the delay then), and connecting the guests to that device, rather than directly to the bridge. Currently linux host bridges cannot be directly connected, but possibly that could be remedied? Or, a more efficient solution would be for the bridge device to allow turning STP off on a port-by-port basis (as mentioned above); the vnet interfaces could then be attached sans-STP, and immediately begin talking. Yet another possibility - with the <network> types added in 0.9.4 which allow describing a libvirt network based on a host bridge, we now have the possibility of setting up a pool of pre-allocated tap devices for use by guests. These tap devices could be created and attached to the bridge when the libvirt network is started, then as each guest (configured with <interface type='network'>) started it would request a tap device from the network's pool of pre-created taps, rather than creating a new one. Since that tap would have already been connected to the bridge for some time, there would be no delay before it was usable. == In the end, while it's definitely a possibility if no other option shows itself as usable, implementing a policy of waiting to start the guest until $stp_delay seconds have elapsed since attaching the virtual interface will not only make guest startup painfully slow, it will also end up defeating STP in those rare cases where it would have actually been useful (i.e. a guest that has its own internal bridge, connected in some manner to multiple physical network segments) - STP's attempts to learn the new network topology during this waiting period will be thwarted by the fact that no traffic is flowing across this new network, then the waiting period will end, the guest will boot, and immediately start forwarding traffic on its internal bridge.
Yeah, I agree the long delay is unfortunate, but only in the case of a true bridge so probably people using the system in that mode will understand the need. If nothing else, if the dhcp timeout in the pxe bios where longer, of if it looped trying to find boot devices like most server bios do, then it would eventually get it's dhcp address and get on with booting.
Jeff, openvswitch support was added to the latest release of Fedora (17), so it's going to end up landing in RHEL. openvswitch supports disabling/enabling STP on a per-port basis. Given the same level of support in libvirt for openvswitch as currently exists for standard linux host bridges, along with the ability to individually (from libvirt's XML config) enable/disable STP for attached guest interfaces, would that be an acceptable solution? (In other words, would you be willing to switch from using standard host bridges to openvswitch in order to solve this problem?) It seems like the alternative would be to convince the kernel / bridge-utils people to add per-port configuration of STP to the standard host bridge, which would likely be a longer process. (I've thought more about the idea of adding a 'start delay' as discussed above, and this has the problem that it doesn't solve the problem when migrating from one host to another - the already-running guest would suffer a network outage until the bridge on the destination host had gotten past the STP delay timeout; very definitely a non-starter).
I have no problem with switching to openvswitch or any other alternative as long as it's stable and performs correctly. However, I'm not sure I agree with the migration issue being a non-starter. There is some downtime during the migration anyway, and the bridge delay in this case may even make more sense since you are migrating a mac address and potentially a topology change as a result of the migration. I am 100% ok with the 30-second (or whatever delay is set in the root bridge) outage. I haven't looked in detail at how the migration process works step by step, but if the guest is created on the new host, including its new bridge interface, then the RAM image is copied, the bridge delay may actually be hidden during the RAM copy time and there would be little if any effect from the bridge delay. At the end of the day it seems like there is at least a simple workaround, if not a complete fix, by simply making gpxe loop on its attempt to pxe-boot if there is no other boot device, as most servers do. I think there should be a limit to the number of loops so that the host doesn't end up burning cpu time on a worthless guest, but that limit could be set to something like 10 or 100 to give the network a ton of time to converge without impacting the host.
laine, any thoughts on the current state of this? is the answer just 'use openvswitch' ?
I don't think anything has changed in this area, and likely the best (least bad?) answer isn't even "use openvswitch"; is the guest side of the PXE boot code in seabios? If so, maybe we could get a patch into that to make the timeout configurable (or at least longer).
Thank you for reporting this issue to the libvirt project. Unfortunately we have been unable to resolve this issue due to insufficient maintainer capacity and it will now be closed. This is not a reflection on the possible validity of the issue, merely the lack of resources to investigate and address it, for which we apologise. If you none the less feel the issue is still important, you may choose to report it again at the new project issue tracker https://gitlab.com/libvirt/libvirt/-/issues The project also welcomes contribution from anyone who believes they can provide a solution.