Bug 1279285

Summary: Unable to start virbr0 "forward_delay: Numerical result out of range" in kernel 4.4.0-0.rc0.git4.2.fc24.x86_64
Product: [Fedora] Fedora Reporter: York Possemiers <ypossem>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rawhideCC: agedosier, atodorov, berrange, clalancette, gansalmon, itamar, jfilak, jforbes, jonathan, jtomko, kernel-maint, laine, libvirt-maint, madhu.chinakonda, mchehab, veillard, virt-maint, vyasevic
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-17 23:24:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Python trace upon attempting to start bridge none

Description York Possemiers 2015-11-09 03:08:00 UTC
Description of problem:
Unable to start bridges made with libvirt in this new kernel

Version-Release number of selected component (if applicable):
virt-manager 1.2.1
libvirtd (libvirt) 1.2.20
kernel 4.4.0-0.rc0.git4.2.fc24.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Boot
2. Open Virtual Networks in virt-manager
3. Attempt to start bridge virbr0 (the default as installed)

Actual results:
Error starting network 'default': Unable to set bridge virbr0 forward_delay: Numerical result out of range

Expected results:
bridge is started on boot

Additional Information:
Above libvirt versions work in kernel 4.3.0-0.rc7.git0.1.fc24.x86_64

Comment 1 York Possemiers 2015-11-09 03:09:23 UTC
Created attachment 1091510 [details]
Python trace upon attempting to start bridge

Comment 2 York Possemiers 2015-11-09 03:10:14 UTC
Does not seem to affect other system bridges made with real interfaces

Comment 3 Laine Stump 2015-11-09 15:36:36 UTC
If memory serves me correctly, it has always been the case that the forward_delay can't be set to 0 after a bridge has had STP turned on. This is why libvirt sets the forward delay before enabling STP. There are two possibilities:

1) maybe the bridge module previously created bridges with STP off, and now creates them with STP on, so that when libvirt tries to set the delay to 0, it fails. (This seems unlikely, since the kernel people usually try to avoid changing defaults)

2) maybe the bridge module previously bypassed the range check for forward_delay when STP was off, and now it checks whether STP is off or on.

I just tried manually setting forward_delay on a bridge with STP on and off (on F22) and found that while it is possible to set it to 0 with STP off, as soon as STP is turned on, it is changed to "200" (that value is in "jiffies", which afaiu is msec/10, so this is equivalent to 2 seconds).

To get your network working again, you should be able to just edit it and change "delay='0'" to "delay='2'".

We do need to think about how to prevent this from failing without manual intervention though - simply changing libvirt to setting delay to a minumum of 2 seconds won't do any good (since there are so many existing networks that have it set to 0), so I guess we need to modify the function that sets forward_delay to silently increase the setting to 2 seconds. I dislike lying to users about what is happening, but then it seems we've already been lying to them for a very long time anyway :-/

Comment 4 Ján Tomko 2015-11-09 15:49:12 UTC
It looks like 2) is the case:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34c2d9f

commit 34c2d9fb0498c066afbe610b15e18995fd8be792
Author:     Ian Wilson <iwilson>
AuthorDate: 2015-09-24 11:20:11 -0700
Commit:     David S. Miller <davem>
CommitDate: 2015-09-27 19:09:38 -0700

    bridge: Allow forward delay to be cfgd when STP enabled
    
    Allow bridge forward delay to be configured when Spanning Tree is enabled.
    
    Signed-off-by: Ian Wilson <iwilson>
    Signed-off-by: Stephen Hemminger <stephen>
    Signed-off-by: David S. Miller <davem>

Comment 5 Laine Stump 2015-11-09 17:32:22 UTC
Looking at that patch, in my opinion the comment doesn't match the code. As far as I can see, it doesn't "allow the forward delay to be configured when STP is enabled" (which was already possible without this patch); the behavior change is instead to force a range check on the desired value when STP is *disabled*. So there is nothing about that comment that is correct.

And now that I've looked/thought some more, I don't think libvirt should have to include any "magic number" for limiting the delay. The bridge module previously allowed any value to be set for forward_delay as long as STP was disabled, and would then clamp it into its internally/privately defined allowable range when STP was turned on (from net/bridge/br_stp_if.c):

    static void br_stp_start(struct net_bridge *br)
     ...
         if (br->bridge_forward_delay < BR_MIN_FORWARD_DELAY)
                 __br_set_forward_delay(br, BR_MIN_FORWARD_DELAY);
         else if (br->bridge_forward_delay > BR_MAX_FORWARD_DELAY)
                 __br_set_forward_delay(br, BR_MAX_FORWARD_DELAY);

Since BR_(MIN|MAX)_FORWARD_DELAY are defined privately by the kernel (in net/bridge/br_private_stp.h), the only way for libvirt or any other userspace program to assure that the value it wants to set is within limits was previously:

1) set the value with STP disabled, then enable STP and let the bridge module clamp the value to something acceptable

or

2) read the kernel source and copy/paste the min/max from there into the userspace code.

With the kernel change Jan pointed out in Comment 4, the only possibility left is (2), and I don't think that's acceptable. In my opinion, the kernel should revert that patch, since it is modifying the semantics of the setting of forward_delay, and will likely cause regressions in other places (and it doesn't even accomplish its advertised intent anyway.)

Comment 6 Laine Stump 2015-11-10 16:01:02 UTC
Vlad has posted a revert to the offending patch to netdev.org:

  http://lists.openwall.net/netdev/2015/11/10/20

Hopefully it will be pushed into the upstream kernel and make it back down to rawhide quickly. In the meantime, rawhide users should use "virsh net-edit default" to change "delay='0'" to "delay='2'" (and do the same for other libvirt networks that aren't starting).

I'm moving this BZ to kernel, since that is where the problem is / fix will be.

Comment 7 York Possemiers 2015-11-17 23:24:33 UTC
Looks like the update to 4.4.0-0.rc1.git0.1 has fixed this

Comment 8 Cole Robinson 2015-11-18 14:22:00 UTC
*** Bug 1283241 has been marked as a duplicate of this bug. ***