Bug 1828604
Summary: | Bonding not failing over in mode=1 under 2.6.32-754.28.1 (...27.1 works OK) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Madison Kelly <mkelly> | ||||||
Component: | kernel | Assignee: | Denys Vlasenko <dvlasenk> | ||||||
kernel sub component: | Bonding | QA Contact: | LiLiang <liali> | ||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||
Severity: | urgent | ||||||||
Priority: | unspecified | CC: | ajawarka, bdoran, dvlasenk, jarod, jbainbri, jeharris, jrd-rhbz, kemyers, kzhang, mezhang, mfalz, mholloway, network-qe, nmurray, pdwyer, prpatel, ptalbert, sserna, sukulkar, toneata, trevor.hemsley, vjadhav | ||||||
Version: | 6.10 | Keywords: | Regression | ||||||
Target Milestone: | rc | Flags: | ptalbert:
needinfo-
|
||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | kernel-2.6.32-754.30.1.el6 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2020-06-09 20:53:34 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1840972 | ||||||||
Attachments: |
|
Description
Madison Kelly
2020-04-27 21:40:08 UTC
This bug can be made public, there is no sensitive info in it. Adding Denys as it looks like they were recently working on the bonding code for this kernel; ==== * Fri Jan 31 2020 Denys Vlasenko <dvlasenk> [2.6.32-754.28.1.el6] - [netdrv] ixgbevf: Use cached link state instead of re-reading the value for ethtool (Ken Cox) [1795404] - [isdn] mISDN: enforce CAP_NET_RAW for raw sockets (Andrea Claudi) [1779473] {CVE-2019-17055} - [net] cfg80211: wext: avoid copying malformed SSIDs (Jarod Wilson) [1778625] {CVE-2019-17133} - [netdrv] bonding: speed/duplex update at NETDEV_UP event (Patrick Talbert) [1772779] - [netdrv] bonding: make speed, duplex setting consistent with link state (Patrick Talbert) [1772779] - [netdrv] bonding: simplify / unify event handling code for 3ad mode (Patrick Talbert) [1772779] - [netdrv] bonding: unify all places where actor-oper key needs to be updated (Patrick Talbert) [1772779] - [netdrv] bonding: simple code refactor (Patrick Talbert) [1772779] ==== Added Patrick as well, given he's referenced in the commit logs. (In reply to digimer from comment #5) > Added Patrick as well, given he's referenced in the commit logs. Hello digimer, Please follow these instructions to gather debug output from the bonding driver when this issue is reproduced: 1. At boot time in the grub menu modify the kernel command line by adding: printk.time log_buf_len=8M bonding.dyndbg="+p" 2. Boot the system and then reproduce the issue. 3. Afterwards, capture the following output and share it with us: # cp /proc/net/bonding/sn_bond1 bond.state # dmesg > dmesg.log Also, is there a case for this issue open with Red Hat Technical Support? Please share the case number with us. Thank you, Patrick Hello, I updated the OS this morning and saw the .29 kernel is out. I also confirmed that the issue remains on .29. Attached are the two requested files. Created attachment 1683389 [details]
bond state
Created attachment 1683390 [details]
dmesg output
Is any more information needed to move this bug forward? Oh I forgot to mention, no, there's no open case. I'm in the ISV program and we found this while testing against the newer kernel(s). Hello digimer, Thank you for that data. I have been able to reproduce the issue. The RHEL6 backport of c4adfc822bf5 ("bonding: make speed, duplex setting consistent with link state") resulted in the bond_update_speed_duplex() function now setting slave->link = BOND_LINK_DOWN whenever the link speed/duplex is Not Good. Prior to this the function did not touch slave->link. For RHEL 6, this results in a short circuit of the link-state-change switch/case check in bond_miimon_inspect(). It means the function will never set up any change for bond_miimon_commit() to commit. This EL6 specific issue is therefore fixed by ensuring bond_update_speed_duplex() does not set slave->link. Thank you, Patrick Wonderful news, thank you! Will this be included in the .30 kernel? Hi digimer, Can this be reproduced by shutdown peer switch port? I can't unplug cable because our system are in a remote lab.. Thanks, Liang. I've not tested, but I very much suspect so. When I pulled the cable (same as downing the port to the bond) it failed to switch over. Likewise, when I cut power to the switch entirely, it seemed to happen. I'm just building a (kvm/qemu) VM at this moment as I wanted to test using virsh to drop the link to reproduce. I should know if that works as a reproducer in about 30 minutes. The issue is NOT reproducible on VMs (emulating e1000 NICs) (on the .29 kernel). virsh domif-setlink <server> vnetX {down,up} Causes the bond to properly fail over. However, I still suspect that dropping a port on a physical switch should be a reliable reproducer. If it is not on your end, I will see if I can reproduce by downing a switch interface in my lab. (In reply to digimer from comment #21) > The issue is NOT reproducible on VMs (emulating e1000 NICs) (on the .29 > kernel). > > virsh domif-setlink <server> vnetX {down,up} > > Causes the bond to properly fail over. However, I still suspect that > dropping a port on a physical switch should be a reliable reproducer. If it > is not on your end, I will see if I can reproduce by downing a switch > interface in my lab. Thanks digimer, i can test this in my lab. But if i can't reproduce this by administration down switch port, could you help to verify this when new kernel is ready? digimer, looks this issue can't be reproduced by administration down peer switch port. i just have a test and failover works after shutdown peer switch port. -liang. I'm happy to test, just let me know where I can get the test kernel. Hello, any update? (In reply to digimer from comment #35) > I'm happy to test, just let me know where I can get the test kernel. Hi digimer, You can find a test kernel here: http://people.redhat.com/ptalbert/ These packages are provided with the following disclaimer: -------------------------- This RPM has been provided by Red Hat for testing purposes only and is NOT supported for any other use. This RPM may contain changes that are necessary for debugging but that are not appropriate for other uses, or that are not compatible with third-party hardware or software. This RPM should NOT be deployed for purposes other than testing and debugging. -------------------------- Thank you, Patrick Patrick, Wonderful, thank you! I'm just finishing up my day, but I will try to load this either this evening or over the weekend. Could I trouble you to add kernel-devel? It's on the machines, but easy enough to remove if it's a hassle for you to upload. digimer (In reply to digimer from comment #38) > Patrick, > > Wonderful, thank you! I'm just finishing up my day, but I will try to load > this either this evening or over the weekend. > > Could I trouble you to add kernel-devel? It's on the machines, but easy > enough to remove if it's a hassle for you to upload. Sure, it's been added. > > digimer Thanks! I'll test as soon as I can and report back. It works!! ==== [root@an-a02n01 ~]# uname -r 2.6.32-754.29.2.el6.test.bz1828604v1.x86_64 ==== May 17 19:36:16 an-a02n01 kernel: [ 399.389828] igb 0000:08:00.1: ifn_link1: igb: ifn_link1 NIC Link is Down May 17 19:36:16 an-a02n01 kernel: [ 399.451671] ifn_bond1: link status definitely down for interface ifn_link1, disabling it May 17 19:36:16 an-a02n01 kernel: [ 399.451785] ifn_bond1: making interface ifn_link2 the new active one May 17 19:36:16 an-a02n01 kernel: [ 399.451885] device ifn_link1 left promiscuous mode May 17 19:36:16 an-a02n01 kernel: [ 399.452154] device ifn_link2 entered promiscuous mode ==== Woohoo!! *** Bug 1836173 has been marked as a duplicate of this bug. *** reproduced [root@hp-dl380g9-06 ~]# modprobe -v bonding mode=1 miimon=100 insmod /lib/modules/2.6.32-754.28.1.el6.x86_64/kernel/drivers/net/bonding/bonding.ko mode=1 miimon=100 [root@hp-dl380g9-06 ~]# ip link set bond0 up [root@hp-dl380g9-06 ~]# ifenslave bond0 eth0 eth1 [root@hp-dl380g9-06 ~]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:10:18:e8:2a:20 Slave queue ID: 0 Slave Interface: eth1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:10:18:e8:2a:22 Slave queue ID: 0 [root@hp-dl380g9-06 ~]# get_iface_sw_port eth0 sw p k [root@hp-dl380g9-06 ~]# swcfg port_down $sw $p spawn ssh root.88.4 Password: --- JUNOS 14.1X53-D35.3 built 2016-03-01 02:31:29 UTC root@sw-j4550-01:RE:0% cli {master:0} root@sw-j4550-01> set cli screen-width 0 Screen width set to 0 {master:0} root@sw-j4550-01> set cli screen-length 0 Screen length set to 0 {master:0} root@sw-j4550-01> configure private warning: uncommitted changes will be discarded on exit Entering configuration mode {master:0}[edit] root@sw-j4550-01# set interfaces xe-0/0/24 disable {master:0}[edit] root@sw-j4550-01# show | diff [edit interfaces xe-0/0/24] + disable; {master:0}[edit] root@sw-j4550-01# commit configuration check succeeds commit complete {master:0}[edit] root@sw-j4550-01# commit commit complete {master:0}[edit] root@sw-j4550-01# commit commit complete {master:0}[edit] root@sw-j4550-01# exit Exiting configuration mode {master:0} root@sw-j4550-01> show interfaces xe-0/0/24 terse Interface Admin Link Proto Local Remote xe-0/0/24 down down xe-0/0/24.0 up down eth-switch {master:0} root@sw-j4550-01> exit root@sw-j4550-01:RE:0% exit logout Connection to 10.73.88.4 closed. [root@hp-dl380g9-06 ~]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: down Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: 00:10:18:e8:2a:20 Slave queue ID: 0 Slave Interface: eth1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:10:18:e8:2a:22 Slave queue ID: 0 [root@hp-dl380g9-06 ~]# uname -r 2.6.32-754.28.1.el6.x86_64 verified, regression test is ongoing [root@hp-dl380g9-06 ~]# modprobe -v bonding mode=1 miimon=100 insmod /lib/modules/2.6.32-754.30.1.el6.x86_64/kernel/drivers/net/bonding/bonding.ko mode=1 miimon=100 [root@hp-dl380g9-06 ~]# ip link set bond0 up [root@hp-dl380g9-06 ~]# ifenslave bond0 eth0 eth1 [root@hp-dl380g9-06 ~]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:10:18:e8:2a:20 Slave queue ID: 0 Slave Interface: eth1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:10:18:e8:2a:22 Slave queue ID: 0 [root@hp-dl380g9-06 ~]# swcfg port_down $sw xe-0/0/24 spawn ssh root.88.4 Password: --- JUNOS 14.1X53-D35.3 built 2016-03-01 02:31:29 UTC root@sw-j4550-01:RE:0% cli {master:0} root@sw-j4550-01> set cli screen-width 0 Screen width set to 0 {master:0} root@sw-j4550-01> set cli screen-length 0 Screen length set to 0 {master:0} root@sw-j4550-01> configure private warning: uncommitted changes will be discarded on exit Entering configuration mode {master:0}[edit] root@sw-j4550-01# set interfaces xe-0/0/24 disable {master:0}[edit] root@sw-j4550-01# show | diff [edit interfaces xe-0/0/24] + disable; {master:0}[edit] root@sw-j4550-01# commit configuration check succeeds commit complete {master:0}[edit] root@sw-j4550-01# commit commit complete {master:0}[edit] root@sw-j4550-01# commit commit complete {master:0}[edit] root@sw-j4550-01# exit Exiting configuration mode {master:0} root@sw-j4550-01> show interfaces xe-0/0/24 terse Interface Admin Link Proto Local Remote xe-0/0/24 down down xe-0/0/24.0 up down eth-switch {master:0} root@sw-j4550-01> exit root@sw-j4550-01:RE:0% exit logout Connection to 10.73.88.4 closed. [root@hp-dl380g9-06 ~]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: down Speed: Unknown Duplex: Unknown Link Failure Count: 1 Permanent HW addr: 00:10:18:e8:2a:20 Slave queue ID: 0 Slave Interface: eth1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:10:18:e8:2a:22 Slave queue ID: 0 [root@hp-dl380g9-06 ~]# uname -r 2.6.32-754.30.1.el6.x86_64 regression test passed, some failed cases are known issues. https://beaker.engineering.redhat.com/recipes/8372865#tasks Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:2430 |