Bug 1828604 - Bonding not failing over in mode=1 under 2.6.32-754.28.1 (...27.1 works OK)
Summary: Bonding not failing over in mode=1 under 2.6.32-754.28.1 (...27.1 works OK)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.10
Hardware: x86_64
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Denys Vlasenko
QA Contact: LiLiang
URL:
Whiteboard:
: 1836173 (view as bug list)
Depends On:
Blocks: 1840972
TreeView+ depends on / blocked
 
Reported: 2020-04-27 21:40 UTC by Madison Kelly
Modified: 2020-08-31 05:59 UTC (History)
22 users (show)

Fixed In Version: kernel-2.6.32-754.30.1.el6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-09 20:53:34 UTC
Target Upstream Version:
ptalbert: needinfo-


Attachments (Terms of Use)
bond state (576 bytes, text/plain)
2020-04-30 17:33 UTC, Madison Kelly
no flags Details
dmesg output (426.40 KB, text/plain)
2020-04-30 17:34 UTC, Madison Kelly
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5103601 None None None 2020-05-25 22:57:58 UTC
Red Hat Product Errata RHSA-2020:2430 None None None 2020-06-09 20:53:50 UTC

Description Madison Kelly 2020-04-27 21:40:08 UTC
Description of problem:

Summary: 

Bonding drivers don't fail over when link drops with mode=1 (active-passive) bonds under kernel-2.6.32-754.28.1.el6.x86_64, works under kernel-2.6.32-754.27.1.el6.x86_64.

Full:

With a two interface active-passive bond, issuing 'ifdown <link1>' works, the backup link takes over. However, if you unplug a cable, /proc/net/bonding/<bond> shows the active interface as 'down', but it remains the in-use interface. So traffic over the bond fails.

Configuration:

====
[root@an-a02n02 ~]# cat /etc/sysconfig/network-scripts/ifcfg-sn_link1 
# Generated by: [InstallManifest.pm] on: [2020-03-24, 19:33:15].
# Storage Network - Link 1
DEVICE="sn_link1"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="sn_bond1"

[root@an-a02n02 ~]# cat /etc/sysconfig/network-scripts/ifcfg-sn_link2
# Generated by: [InstallManifest.pm] on: [2020-03-24, 19:33:15].
# Storage Network - Link 2
DEVICE="sn_link2"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="sn_bond1"

[root@an-a02n02 ~]# cat /etc/sysconfig/network-scripts/ifcfg-sn_bond1 
# Generated by: [InstallManifest.pm] on: [2020-03-24, 19:33:15].
# Storage Network - Bond 1
DEVICE="sn_bond1"
BOOTPROTO="static"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=sn_link1 primary_reselect=always"
IPADDR="10.10.20.2"
NETMASK="255.255.0.0"
DEFROUTE="no"
====

-=] Under 2.6.32-754.27.1.el6.x86_64 [=-

/proc/net/bonding/sn_bond1 pre-failure:

====
[root@an-a02n02 ~]# cat /proc/net/bonding/sn_bond1 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: sn_link1 (primary_reselect always)
Currently Active Slave: sn_link1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 120000
Down Delay (ms): 0

Slave Interface: sn_link1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b4:96:91:4f:10:15
Slave queue ID: 0

Slave Interface: sn_link2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b4:96:91:4f:10:14
Slave queue ID: 0
====

/var/log/messages failing the sn_link1:

====
Apr 27 17:22:01 an-a02n02 kernel: ixgbe 0000:05:00.1: sn_link1: NIC Link is Down
Apr 27 17:22:01 an-a02n02 kernel: sn_bond1: link status definitely down for interface sn_link1, disabling it
Apr 27 17:22:01 an-a02n02 kernel: sn_bond1: making interface sn_link2 the new active one
====

/proc/net/bonding/sn_bond1 post-failure:

====
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: sn_link1 (primary_reselect always)
Currently Active Slave: sn_link2
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 120000
Down Delay (ms): 0

Slave Interface: sn_link1
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 1
Permanent HW addr: b4:96:91:4f:10:15
Slave queue ID: 0

Slave Interface: sn_link2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b4:96:91:4f:10:14
Slave queue ID: 0
====

Worked fine.

-=] Under 2.6.32-754.28.1.el6.x86_64 [=-

/proc/net/bonding/sn_bond1 pre-failure:

====
[root@an-a02n02 ~]# cat /proc/net/bonding/sn_bond1 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: sn_link1 (primary_reselect always)
Currently Active Slave: sn_link1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 120000
Down Delay (ms): 0

Slave Interface: sn_link1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b4:96:91:4f:10:15
Slave queue ID: 0

Slave Interface: sn_link2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b4:96:91:4f:10:14
Slave queue ID: 0
====

/var/log/messages failing the sn_link1 (just the one line...):

====
Apr 27 17:32:08 an-a02n02 kernel: ixgbe 0000:05:00.1: sn_link1: NIC Link is Down
====

/proc/net/bonding/sn_bond1 post-failure:

====
[root@an-a02n02 ~]# cat /proc/net/bonding/sn_bond1 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: sn_link1 (primary_reselect always)
Currently Active Slave: sn_link1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 120000
Down Delay (ms): 0

Slave Interface: sn_link1
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: b4:96:91:4f:10:15
Slave queue ID: 0

Slave Interface: sn_link2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b4:96:91:4f:10:14
Slave queue ID: 0
====

Version-Release number of selected component (if applicable):

See above


How reproducible:

100%


Steps to Reproduce:
1. Create bond as described above
2. Physically fail an interface (do not use 'ifdown')
3.

Actual results:

Redundant bonding fails


Expected results:

Redundant bonding fails over on physical link failure


Additional info:

There are a lot of entries on bond changes in the .28 kernel;

====
[root@an-a02n02 ~]# rpm -q --changelog kernel-2.6.32-754.28.1.el6.x86_64 | grep bond | wc -l
794
====

Comment 3 Madison Kelly 2020-04-27 21:46:15 UTC
This bug can be made public, there is no sensitive info in it.

Comment 4 Madison Kelly 2020-04-27 22:15:34 UTC
Adding Denys as it looks like they were recently working on the bonding code for this kernel;

====
* Fri Jan 31 2020 Denys Vlasenko <dvlasenk@redhat.com> [2.6.32-754.28.1.el6]
- [netdrv] ixgbevf: Use cached link state instead of re-reading the value for ethtool (Ken Cox) [1795404]
- [isdn] mISDN: enforce CAP_NET_RAW for raw sockets (Andrea Claudi) [1779473] {CVE-2019-17055}
- [net] cfg80211: wext: avoid copying malformed SSIDs (Jarod Wilson) [1778625] {CVE-2019-17133}
- [netdrv] bonding: speed/duplex update at NETDEV_UP event (Patrick Talbert) [1772779]
- [netdrv] bonding: make speed, duplex setting consistent with link state (Patrick Talbert) [1772779]
- [netdrv] bonding: simplify / unify event handling code for 3ad mode (Patrick Talbert) [1772779]
- [netdrv] bonding: unify all places where actor-oper key needs to be updated (Patrick Talbert) [1772779]
- [netdrv] bonding: simple code refactor (Patrick Talbert) [1772779]
====

Comment 5 Madison Kelly 2020-04-27 22:18:46 UTC
Added Patrick as well, given he's referenced in the commit logs.

Comment 7 Patrick Talbert 2020-04-29 08:29:15 UTC
(In reply to digimer from comment #5)
> Added Patrick as well, given he's referenced in the commit logs.

Hello digimer,

Please follow these instructions to gather debug output from the bonding driver when this issue is reproduced:

1. At boot time in the grub menu modify the kernel command line by adding:

printk.time log_buf_len=8M bonding.dyndbg="+p"


2. Boot the system and then reproduce the issue.

3. Afterwards, capture the following output and share it with us:

# cp /proc/net/bonding/sn_bond1 bond.state
# dmesg > dmesg.log



Also, is there a case for this issue open with Red Hat Technical Support? Please share the case number with us.

Thank you,

Patrick

Comment 8 Madison Kelly 2020-04-30 17:33:19 UTC
Hello,

  I updated the OS this morning and saw the .29 kernel is out. I also confirmed that the issue remains on .29.

Attached are the two requested files.

Comment 9 Madison Kelly 2020-04-30 17:33:53 UTC
Created attachment 1683389 [details]
bond state

Comment 10 Madison Kelly 2020-04-30 17:34:37 UTC
Created attachment 1683390 [details]
dmesg output

Comment 12 Madison Kelly 2020-05-04 20:35:31 UTC
Is any more information needed to move this bug forward?

Comment 13 Madison Kelly 2020-05-05 18:18:43 UTC
Oh I forgot to mention, no, there's no open case. I'm in the ISV program and we found this while testing against the newer kernel(s).

Comment 17 Patrick Talbert 2020-05-06 15:35:24 UTC
Hello digimer,

Thank you for that data.

I have been able to reproduce the issue.

The RHEL6 backport of c4adfc822bf5 ("bonding: make speed, duplex setting consistent with link state") resulted in the bond_update_speed_duplex() function now setting slave->link = BOND_LINK_DOWN whenever the link speed/duplex is Not Good. Prior to this the function did not touch slave->link.

For RHEL 6, this results in a short circuit of the link-state-change switch/case check in bond_miimon_inspect(). It means the function will never set up any change for bond_miimon_commit() to commit.


This EL6 specific issue is therefore fixed by ensuring bond_update_speed_duplex() does not set slave->link.


Thank you,

Patrick

Comment 18 Madison Kelly 2020-05-06 17:04:38 UTC
Wonderful news, thank you! Will this be included in the .30 kernel?

Comment 19 LiLiang 2020-05-07 02:04:13 UTC
Hi digimer,

Can this be reproduced by shutdown peer switch port? I can't unplug cable because our system are in a remote lab..

Thanks,
Liang.

Comment 20 Madison Kelly 2020-05-07 02:21:02 UTC
I've not tested, but I very much suspect so. When I pulled the cable (same as downing the port to the bond) it failed to switch over. Likewise, when I cut power to the switch entirely, it seemed to happen.

I'm just building a (kvm/qemu) VM at this moment as I wanted to test using virsh to drop the link to reproduce. I should know if that works as a reproducer in about 30 minutes.

Comment 21 Madison Kelly 2020-05-07 02:36:23 UTC
The issue is NOT reproducible on VMs (emulating e1000 NICs) (on the .29 kernel). 

virsh domif-setlink <server> vnetX {down,up}

Causes the bond to properly fail over. However, I still suspect that dropping a port on a physical switch should be a reliable reproducer. If it is not on your end, I will see if I can reproduce by downing a switch interface in my lab.

Comment 22 LiLiang 2020-05-07 02:45:35 UTC
(In reply to digimer from comment #21)
> The issue is NOT reproducible on VMs (emulating e1000 NICs) (on the .29
> kernel). 
> 
> virsh domif-setlink <server> vnetX {down,up}
> 
> Causes the bond to properly fail over. However, I still suspect that
> dropping a port on a physical switch should be a reliable reproducer. If it
> is not on your end, I will see if I can reproduce by downing a switch
> interface in my lab.

Thanks digimer, i can test this in my lab. 
But if i can't reproduce this by administration down switch port, could you help to verify this when new kernel is ready?

Comment 23 LiLiang 2020-05-07 03:18:50 UTC
digimer,

looks this issue can't be reproduced by administration down peer switch port. i just have a test and failover works after shutdown peer switch port.

-liang.

Comment 35 Madison Kelly 2020-05-07 13:44:02 UTC
I'm happy to test, just let me know where I can get the test kernel.

Comment 36 Madison Kelly 2020-05-11 16:45:41 UTC
Hello, any update?

Comment 37 Patrick Talbert 2020-05-15 15:09:00 UTC
(In reply to digimer from comment #35)
> I'm happy to test, just let me know where I can get the test kernel.

Hi digimer,

You can find a test kernel here:

http://people.redhat.com/ptalbert/


These packages are provided with the following disclaimer:

--------------------------

This RPM has been provided by Red Hat for testing purposes only and is
NOT supported for any other use. This RPM may contain changes that are
necessary for debugging but that are not appropriate for other uses,
or that are not compatible with third-party hardware or software. This
RPM should NOT be deployed for purposes other than testing and
debugging.

--------------------------


Thank you,

Patrick

Comment 38 Madison Kelly 2020-05-15 16:57:20 UTC
Patrick,

  Wonderful, thank you! I'm just finishing up my day, but I will try to load this either this evening or over the weekend. 

  Could I trouble you to add kernel-devel? It's on the machines, but easy enough to remove if it's a hassle for you to upload.

digimer

Comment 39 Patrick Talbert 2020-05-15 17:15:24 UTC
(In reply to digimer from comment #38)
> Patrick,
> 
>   Wonderful, thank you! I'm just finishing up my day, but I will try to load
> this either this evening or over the weekend. 
> 
>   Could I trouble you to add kernel-devel? It's on the machines, but easy
> enough to remove if it's a hassle for you to upload.

Sure, it's been added.

> 
> digimer

Comment 40 Madison Kelly 2020-05-15 17:16:00 UTC
Thanks! I'll test as soon as I can and report back.

Comment 41 Madison Kelly 2020-05-17 23:37:18 UTC
It works!!

====
[root@an-a02n01 ~]# uname -r
2.6.32-754.29.2.el6.test.bz1828604v1.x86_64
====
May 17 19:36:16 an-a02n01 kernel: [  399.389828] igb 0000:08:00.1: ifn_link1: igb: ifn_link1 NIC Link is Down
May 17 19:36:16 an-a02n01 kernel: [  399.451671] ifn_bond1: link status definitely down for interface ifn_link1, disabling it
May 17 19:36:16 an-a02n01 kernel: [  399.451785] ifn_bond1: making interface ifn_link2 the new active one
May 17 19:36:16 an-a02n01 kernel: [  399.451885] device ifn_link1 left promiscuous mode
May 17 19:36:16 an-a02n01 kernel: [  399.452154] device ifn_link2 entered promiscuous mode
====

Woohoo!!

Comment 42 Patrick Talbert 2020-05-22 12:47:23 UTC
*** Bug 1836173 has been marked as a duplicate of this bug. ***

Comment 55 LiLiang 2020-06-01 02:14:11 UTC
reproduced

[root@hp-dl380g9-06 ~]# modprobe -v bonding mode=1 miimon=100 
insmod /lib/modules/2.6.32-754.28.1.el6.x86_64/kernel/drivers/net/bonding/bonding.ko mode=1 miimon=100
[root@hp-dl380g9-06 ~]# ip link set bond0 up
[root@hp-dl380g9-06 ~]# ifenslave bond0 eth0 eth1
[root@hp-dl380g9-06 ~]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:10:18:e8:2a:20
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:10:18:e8:2a:22
Slave queue ID: 0

[root@hp-dl380g9-06 ~]# get_iface_sw_port eth0 sw p k
[root@hp-dl380g9-06 ~]# swcfg port_down $sw $p
spawn ssh root@10.73.88.4
Password:
--- JUNOS 14.1X53-D35.3 built 2016-03-01 02:31:29 UTC
root@sw-j4550-01:RE:0% cli
{master:0}
root@sw-j4550-01> set cli screen-width 0 
Screen width set to 0

{master:0}
root@sw-j4550-01> set cli screen-length 0 
Screen length set to 0

{master:0}
root@sw-j4550-01> configure private 
warning: uncommitted changes will be discarded on exit
Entering configuration mode

{master:0}[edit]
root@sw-j4550-01# set interfaces xe-0/0/24 disable 

{master:0}[edit]
root@sw-j4550-01# show | diff 
[edit interfaces xe-0/0/24]
+   disable;

{master:0}[edit]
root@sw-j4550-01# commit 
configuration check succeeds
commit complete

{master:0}[edit]
root@sw-j4550-01# commit 
commit complete

{master:0}[edit]
root@sw-j4550-01# commit 
commit complete

{master:0}[edit]
root@sw-j4550-01# exit 
Exiting configuration mode

{master:0}
root@sw-j4550-01> show interfaces xe-0/0/24 terse 
Interface               Admin Link Proto    Local                 Remote
xe-0/0/24               down  down
xe-0/0/24.0             up    down eth-switch

{master:0}
root@sw-j4550-01> exit 

root@sw-j4550-01:RE:0% exit
logout
Connection to 10.73.88.4 closed.

[root@hp-dl380g9-06 ~]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: 00:10:18:e8:2a:20
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:10:18:e8:2a:22
Slave queue ID: 0
[root@hp-dl380g9-06 ~]# uname -r
2.6.32-754.28.1.el6.x86_64

Comment 56 LiLiang 2020-06-01 02:39:54 UTC
verified, regression test is ongoing


[root@hp-dl380g9-06 ~]# modprobe -v bonding mode=1 miimon=100
insmod /lib/modules/2.6.32-754.30.1.el6.x86_64/kernel/drivers/net/bonding/bonding.ko mode=1 miimon=100
[root@hp-dl380g9-06 ~]# ip link set bond0 up
[root@hp-dl380g9-06 ~]# ifenslave bond0 eth0 eth1
[root@hp-dl380g9-06 ~]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:10:18:e8:2a:20
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:10:18:e8:2a:22
Slave queue ID: 0

[root@hp-dl380g9-06 ~]# swcfg port_down $sw xe-0/0/24
spawn ssh root@10.73.88.4
Password:
--- JUNOS 14.1X53-D35.3 built 2016-03-01 02:31:29 UTC
root@sw-j4550-01:RE:0% cli
{master:0}
root@sw-j4550-01> set cli screen-width 0 
Screen width set to 0

{master:0}
root@sw-j4550-01> set cli screen-length 0 
Screen length set to 0

{master:0}
root@sw-j4550-01> configure private 
warning: uncommitted changes will be discarded on exit
Entering configuration mode

{master:0}[edit]
root@sw-j4550-01# set interfaces xe-0/0/24 disable 

{master:0}[edit]
root@sw-j4550-01# show | diff 
[edit interfaces xe-0/0/24]
+   disable;

{master:0}[edit]
root@sw-j4550-01# commit 
configuration check succeeds
commit complete

{master:0}[edit]
root@sw-j4550-01# commit 
commit complete

{master:0}[edit]
root@sw-j4550-01# commit 
commit complete

{master:0}[edit]
root@sw-j4550-01# exit 
Exiting configuration mode

{master:0}
root@sw-j4550-01> show interfaces xe-0/0/24 terse 
Interface               Admin Link Proto    Local                 Remote
xe-0/0/24               down  down
xe-0/0/24.0             up    down eth-switch

{master:0}
root@sw-j4550-01> exit 

root@sw-j4550-01:RE:0% exit
logout
Connection to 10.73.88.4 closed.

[root@hp-dl380g9-06 ~]# cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 1
Permanent HW addr: 00:10:18:e8:2a:20
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:10:18:e8:2a:22
Slave queue ID: 0

[root@hp-dl380g9-06 ~]# uname -r
2.6.32-754.30.1.el6.x86_64

Comment 57 LiLiang 2020-06-03 03:12:11 UTC
regression test passed, some failed cases are known issues.

https://beaker.engineering.redhat.com/recipes/8372865#tasks

Comment 62 errata-xmlrpc 2020-06-09 20:53:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:2430


Note You need to log in before you can comment on or make changes to this bug.