Bug 1162822

Summary: ifup of bridge with STP=on fails (even when DELAY=0)
Product: [Fedora] Fedora Reporter: Laine Stump <laine>
Component: initscriptsAssignee: Lukáš Nykrýn <lnykryn>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: jonathan, lnykryn, vpavlin, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: initscripts-9.51-3.fc20 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-11-12 13:23:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
patch against current upstream git of initscripts none

Description Laine Stump 2014-11-11 18:58:35 UTC
with NM disabled and the network service enabled, the following standard bridge configuration fails ifup every time:

[root@localhost 1105012]# cat /etc/sysconfig/network-scripts/ifcfg-br0
DEVICE=br0
ONBOOT=no
HOTPLUG=yes
TYPE=Bridge
BOOTPROTO=dhcp
STP=on
DELAY=0
[root@localhost 1105012]# cat /etc/sysconfig/network-scripts/ifcfg-enp2s0 
DEVICE=enp2s0
HWADDR=00:11:22:33:44:55
ONBOOT=no
HOTPLUG=yes
BRIDGE=br0

# ifup enp2s0; ifup br0;

Determining IP information for br0... failed; no link present.  Check cable?


Since this was originally reported wrt using libvirt's "virsh
iface-start" command (which calls a function in the netcf library), I
at first thought there might be a problem with the order that netcf
was ifup'ing the interfaces - in a discussion somewhere I'd seen
someone mention that they were ifup'ing the bridge first, then the
ethernets, which is the opposite of what netcf does.

But manual experimentation shows that netcf is doing it in the correct
order, and (as was suggested by someone triaging the original bug
report) adding a sufficiently large LINKDELAY to ifcfg-br0 does solve
the problem. However, we should not require every existing
installation with a bridge device and STP enabled to modify their
config. Instead, initscripts' ifup should properly account for this
needed delay when it notices that STP is enabled.

For the record, here is the sequence of events that leads to the problem:

1) "ifup $ether" calls /etc/sysconfig/network-scripts/ifup-eth; it
does this:

  1a) auto-create the $bridge *with an implicit 0 forward delay* but
      still "down".
  1b) "ip link set dev $ether up"
  1b) sleep for $LINKDELAY seconds (as set in the ifcfg-$ether, NOT
      the ifcfg-$bridge)
  1c) brctl addif -- $bridge $ether

(at this point if you look at "brctl showstp $bridge" you'll see that
the $ether port is in "disabled" state)

2) "ifup $bridge" - this again ends up in
/etc/sysconfig/network-scripts/ifup-eth, which:

  2a) (doesn't create the bridge device, because it was already
      auto-created in step (1a).
  2b) sets a forward delay and other bridge options according to
      ifcfg-$bridge

  2c) *IF* the device has "BOOTPROTO=dhcp", it goes into a loop
      waiting for up to LINKDELAY seconds until
      /sys/class/net/$bridge/carrier contains "1" rather than "0".
      (NB: this will happen as soon as at least one device attached to
      the bridge is in "forwarding" state.)

Experimentation shows that when STP is enabled on the bridge, step 2c
takes *at least* ${DELAY} * 2 + 5 seconds, and sometimes as much as
$DELAY * 2 + 6.5 seconds. But when no LINKDELAY is set,
check_link_down() only waits for 5 seconds, so it will *always*
fail. (this happens regardless of how much time passes between the
first and second ifup invocations; also note that doing the ifups in
the opposite order woul also always fail, since carrier would *never*
go up on the bridge device if it had nothing attached).

Since I'm fairly certain that people have been configuring bridges
with a non-0 DELAY for many years and haven't previously encountered
this problem, I would class this as a regression in the behavior of
ifup that must be resolved.

Comment 1 Laine Stump 2014-11-11 19:03:15 UTC
Created attachment 956390 [details]
patch against current upstream git of initscripts

This patch causes ifup to wait at least this long for carrier on a bridge device when STP is enabled. This has caused all tests I've tried for differing values of STP, DELAY, and LINKDELAY to succeed.

Note that although I filed this BZ against rawhide, the problem exists at least as far back as F20, as well as in RHEL7 and CentOS7 (I haven't checked RHEL6, but think that it *isn't* a problem there) so it should be backported to all of those releases.

Comment 2 Fedora Update System 2014-11-12 13:05:03 UTC
initscripts-9.56.1-4.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/initscripts-9.56.1-4.fc21

Comment 3 Fedora Update System 2014-11-12 13:15:49 UTC
initscripts-9.51-3.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/initscripts-9.51-3.fc20

Comment 5 Fedora Update System 2014-11-16 14:45:33 UTC
initscripts-9.56.1-4.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 6 Fedora Update System 2014-11-20 23:02:21 UTC
initscripts-9.51-3.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.