Bug 789151 - qemu guests with bridge interfaces fail pxe/dhcp due to STP forwarding delay
Summary: qemu guests with bridge interfaces fail pxe/dhcp due to STP forwarding delay
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Laine Stump
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-09 23:21 UTC by Jeff Thomas
Modified: 2020-11-03 16:34 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-03 16:34:25 UTC


Attachments (Terms of Use)

Description Jeff Thomas 2012-02-09 23:21:44 UTC
Description of problem:

On a host with multiple physical network interfaces connecting to multiple switches, STP must run on a bridge interface to prevent network loops. Virtual guests connecting to that bridge then encounter a forwarding delay when their virtual interface is created. That delay is likely to be longer than the pxe/dhcp wait time in the gpxe bios, so the guest fails to boot when they attempt to boot from the network.

We need either a configurable startup delay between the interface creation and the guest startup, or the code which creates the guest virtual network interface should query the bridge to get the STP forwarding delay and automatically wait until the interface enters forwarding state before allowing guest startup to continue.  OR, we need a configurable delay in the gPXE dhcp client to allow it to wait long enough to get an address after the new interface enters forwarding state.


Version-Release number of selected component (if applicable):

All RHEL6 appears to be affected. Tested on 6.2 with libvirt-0.9.4-23.el6_2.4.x86_64, kernel-2.6.32-220.4.1.el6.x86_64


How reproducible:

100%

Steps to Reproduce:
1. Connect a host to 2 network switches via 2 physical interfaces, configure one bridge interface to connect those 2 physical interfaces. Do NOT disable STP on the bridge interface.  Use 'brctl showstp <bridgename>' to observe the forwarding delay and other parameters learned from the root bridge.

2. Create a virtual guest connecting via the host bridge interface

3. Start the guest, observe DHCP timeout in guest. /var/log/messages will show the 'vnet' interface going through 'listening', 'learning', and 'forwarding' states. The transition to 'forwarding' state happens after the guest DHCP timeout.
  
Actual results:

Guest fails gPXE DHCP

Expected results:

Guest should get a DHCP response and boot.

Additional info:

# brctl show
bridge name	bridge id		STP enabled	interfaces
intbr0		8000.441ea103e078	yes		eth0
							eth2
							vnet3

# brctl showstp intbr0
intbr0
 bridge id		8000.441ea103e078
 designated root	60d8.7081058bdac0
 root port		   1			path cost		   4
 max age		  19.99			bridge max age		  19.99
 hello time		   1.99			bridge hello time	   1.99
 forward delay		  14.99			bridge forward delay	  14.99
 ageing time		 299.95
 hello timer		   0.00			tcn timer		   0.00
 topology change timer	   0.00			gc timer		   9.68
 hash elasticity	   4			hash max		 512
 mc last member count	   2			mc init query count	   2
 mc router		   1			mc snooping		   1
 mc last member timer	   0.99			mc membership timer	 259.96
 mc querier timer	 254.96			mc query interval	 124.98
 mc response interval	   9.99			mc init query interval	  31.24
 flags			


# /var/log/messages after "virsh create Guest"
Feb  9 13:35:30 dmzsrv4 kernel: device vnet3 entered promiscuous mode
Feb  9 13:35:30 dmzsrv4 kernel: intbr0: port 4(vnet3) entering listening state
Feb  9 13:35:45 dmzsrv4 kernel: intbr0: port 4(vnet3) entering learning state
Feb  9 13:36:00 dmzsrv4 kernel: intbr0: topology change detected, sending tcn bpdu
Feb  9 13:36:00 dmzsrv4 kernel: intbr0: port 4(vnet3) entering forwarding state



# interface config from the Guest.xml file:
    <interface type='bridge'>
      <mac address='52:54:00:84:94:04'/>
      <source bridge='intbr0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

Comment 1 Jeff Thomas 2012-02-09 23:23:51 UTC
See also Fedora bug 586324

Comment 3 Laine Stump 2012-02-10 18:28:57 UTC
As I understand, setting the STP delay to 0 in your case isn't acceptable, because the bridge is connected to multiple physical devices so might create a loop.

It really is unfortunate that the STP delay is only configurable for the bridge as a whole, and can't be set independently for each interface attached to a bridge (and beyond that, as you point out in bug 586324, the delay setting from the root bridge on the network will eventually override any local config anyway) - if individual interfaces could be attached to the bridge with stp=no or delay=0, our problem would be solved.

A bit of brainstorming before we resign ourselves to building in a delay to guest startup:

Since a virt guest will obviously not create a loop (unless the guest itself also contains a bridge device that is in turn connected to multiple physical networks), it would be no problem if the delay was 0 when a guest's interface is attached - the time when a non-0 delay is needed is when the original physical interfaces are attached to the bridge.

Since a delay of up to 30 seconds would be fairly common, and a very long time to wait for a new guest to start, maybe we should instead look for a solution that attaches some other bridgelike device to the bridge just once at libvirtd startup (suffering the delay then), and connecting the guests to that device, rather than directly to the bridge. Currently linux host bridges cannot be directly connected, but possibly that could be remedied?

Or, a more efficient solution would be for the bridge device to allow turning STP off on a port-by-port basis (as mentioned above); the vnet interfaces could then be attached sans-STP, and immediately begin talking.

Yet another possibility - with the <network> types added in 0.9.4 which allow describing a libvirt network based on a host bridge, we now have the possibility of setting up a pool of pre-allocated tap devices for use by guests. These tap devices could be created and attached to the bridge when the libvirt network is started, then as each guest (configured with <interface type='network'>) started it would request a tap device from the network's pool of pre-created taps, rather than creating a new one. Since that tap would have already been connected to the bridge for some time, there would be no delay before it was usable.

==

In the end, while it's definitely a possibility if no other option shows itself as usable, implementing a policy of waiting to start the guest until  $stp_delay seconds have elapsed since attaching the virtual interface will not only make guest startup painfully slow, it will also end up defeating STP in those rare cases where it would have actually been useful (i.e. a guest that has its own internal bridge, connected in some manner to multiple physical network segments) - STP's attempts to learn the new network topology during this waiting period will be thwarted by the fact that no traffic is flowing across this new network, then the waiting period will end, the guest will boot, and immediately start forwarding traffic on its internal bridge.

Comment 4 Jeff Thomas 2012-02-11 23:00:50 UTC
Yeah, I agree the long delay is unfortunate, but only in the case of a true bridge so probably people using the system in that mode will understand the need.

If nothing else, if the dhcp timeout in the pxe bios where longer, of if it looped trying to find boot devices like most server bios do, then it would eventually get it's dhcp address and get on with booting.

Comment 5 Laine Stump 2012-05-31 15:38:04 UTC
Jeff,

openvswitch support was added to the latest release of Fedora (17), so it's going to end up landing in RHEL. openvswitch supports disabling/enabling STP on a per-port basis. Given the same level of support in libvirt for openvswitch as currently exists for standard linux host bridges, along with the ability to individually (from libvirt's XML config) enable/disable STP for attached guest interfaces, would that be an acceptable solution? (In other words, would you be willing to switch from using standard host bridges to openvswitch in order to solve this problem?)

It seems like the alternative would be to convince the kernel / bridge-utils people to add per-port configuration of STP to the standard host bridge, which would likely be a longer process.

(I've thought more about the idea of adding a 'start delay' as discussed above, and this has the problem that it doesn't solve the problem when migrating from one host to another - the already-running guest would suffer a network outage until the bridge on the destination host had gotten past the STP delay timeout; very definitely a non-starter).

Comment 6 Jeff Thomas 2012-05-31 15:55:21 UTC
I have no problem with switching to openvswitch or any other alternative as long as it's stable and performs correctly.

However, I'm not sure I agree with the migration issue being a non-starter. There is some downtime during the migration anyway, and the bridge delay in this case may even make more sense since you are migrating a mac address and potentially a topology change as a result of the migration. I am 100% ok with the 30-second (or whatever delay is set in the root bridge) outage. I haven't looked in detail at how the migration process works step by step, but if the guest is created on the new host, including its new bridge interface, then the RAM image is copied, the bridge delay may actually be hidden during the RAM copy time and there would be little if any effect from the bridge delay.

At the end of the day it seems like there is at least a simple workaround, if not a complete fix, by simply making gpxe loop on its attempt to pxe-boot if there is no other boot device, as most servers do. I think there should be a limit to the number of loops so that the host doesn't end up burning cpu time on a worthless guest, but that limit could be set to something like 10 or 100 to give the network a ton of time to converge without impacting the host.

Comment 8 Cole Robinson 2016-03-23 21:27:27 UTC
laine, any thoughts on the current state of this? is the answer just 'use openvswitch' ?

Comment 9 Laine Stump 2016-03-24 01:52:31 UTC
I don't think anything has changed in this area, and likely the best (least bad?) answer isn't even "use openvswitch"; is the guest side of the PXE boot code in seabios? If so, maybe we could get a patch into that to make the timeout configurable (or at least longer).

Comment 10 Daniel Berrangé 2020-11-03 16:34:25 UTC
Thank you for reporting this issue to the libvirt project. Unfortunately we have been unable to resolve this issue due to insufficient maintainer capacity and it will now be closed. This is not a reflection on the possible validity of the issue, merely the lack of resources to investigate and address it, for which we apologise. If you none the less feel the issue is still important, you may choose to report it again at the new project issue tracker https://gitlab.com/libvirt/libvirt/-/issues The project also welcomes contribution from anyone who believes they can provide a solution.


Note You need to log in before you can comment on or make changes to this bug.