Description of problem: The be2net (version 2.102.115r according to modinfo) driver shipped with 2.6.18-194.17.1 fails to work with bonding, causing the active interface to "flap" constantly back and forth when used with HP nc553i (Emulex Converged Adapter) as shipped with HP Proliant BL460G7c Blade servers. Version-Release number of selected component (if applicable): kernel 2.6.18-194.17.1 x86_64 How reproducible: Always Steps to Reproduce: 1. setup a bonding interface in an active-backup configuration using the be2net network card 2. set the be2net card to be the primary/active card 3. watch syslog. Actual results: Syslog shows: Jan 21 11:11:13 schhyt14 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 21 11:11:14 schhyt14 kernel: bonding: bond0: now running without any active interface ! Jan 21 11:11:14 schhyt14 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 21 11:11:14 schhyt14 kernel: bonding: bond0: making interface eth1 the new active one. Jan 21 11:11:14 schhyt14 kernel: bonding: bond0: first active interface up! Jan 21 11:11:14 schhyt14 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 21 11:11:14 schhyt14 kernel: bonding: bond0: now running without any active interface ! Jan 21 11:11:14 schhyt14 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 21 11:11:14 schhyt14 kernel: bonding: bond0: making interface eth1 the new active one. Jan 21 11:11:14 schhyt14 kernel: bonding: bond0: first active interface up! These messages appear at the frequency specified by the arp_interval bonding parameter Expected results: The network card should not flap unless a failure happens. Additional info: The problem is solved by installing the be2net drivers supplied by HP as found on http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareIndex.jsp?lang=en&cc=us&prodNameId=4324630&prodTypeId=329290&prodSeriesId=4324629&swLang=8&taskId=135&swEnvOID=4004 This driver packages uses version 2.102.514.0 of the be2net driver. I would like the be2net driver included in RHEL's kernel package to be updated. We are willing to use a test kernel if needed.
Triage assignment. If you feel this bug doesn't belong to you, or that it cannot be handled in a timely fashion, please contact me for re-assignment
This issue is fixed in the RHEL 5.6 released Jan 13 as the driver version in 5.6 release is 2.102.512r.
Do you have the link to the errata with the fix ?
I'm going to repeat my testing with kernel 2.6.18-238.1.1 and report back with results.
Just finished installing 2.6.18-238.1.1 and rebooted, i was greeted by this panic message after the system finished configuring the network CR2: 00002abb1968e0a0 CR3: 0000000000201000 CR4: 00000000000006e0 Process bond0 (pid: 5233, threadinfo ffff81060638e000, task ffff810c0a9b60c0) Stack: ffff810c08d3e8d8 ffff810c08d3e8e0 ffff810c0991e7c0 0000000000000282 ffff810c08d3e500 ffffffff8004d7d0 ffff81060638fe80 ffff810c0991e7c0 ffffffff8004a018 ffff810c034ffd68 0000000000000282 ffff810c034ffd58 Call Trace: [<ffffffff8004d7d0>] run_workqueue+0x99/0xf6 [<ffffffff8004a018>] worker_thread+0x0/0x122 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff8004a108>] worker_thread+0xf0/0x122 [<ffffffff8008e40a>] default_wake_function+0x0/0xe [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032996>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032898>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 0f 0b 68 aa f4 71 88 c2 87 00 48 8d 5d 34 48 89 df e8 57 e5 RIP [<ffffffff88716607>] :bonding:bond_activebackup_arp_mon+0x331/0x55d RSP <ffff81060638fe10> <0>Kernel panic - not syncing: Fatal exception
More info: Looks like this driver is still affected by flapping, as you can see there are 10 flapping events per second (see syslog fragment below) matching the arp_interval setting of 100ms. After about 2 minutes, give it or take, of running like this the system will crash with the panic described below. Furthermore, for some strange reason the correct speed setting on eth3 fails to be detected and it falls back to 100Mb/s (it's 1GBe) This is the running bonding config as used on ifcfg-bond0: BONDING_OPTS="arp_interval=100 arp_ip_target=192.168.1.1 mode=1 primary=eth1" This bonding is made out of 2 Different Cards (for reliability) eth1 = be2net eth3 = tg3 uname output: ================== # uname -a Linux schhyt14 2.6.18-238.1.1.el5 #1 SMP Tue Jan 4 13:32:19 EST 2011 x86_64 x86_64 x86_64 GNU/Linux lsmod: ====== # lsmod Module Size Used by nfs 298413 1 fscache 52385 1 nfs nfs_acl 36673 1 nfs lockd 101553 1 nfs sunrpc 200073 8 nfs,nfs_acl,lockd tg3 161736 0 be2net 85325 0 8021q 57937 1 be2net bonding 142441 0 ipv6 435105 97 bonding xfrm_nalgo 43333 1 ipv6 crypto_api 42945 1 xfrm_nalgo emcpvlumd 68064 0 emcpxcrypt 166248 0 emcpdm 75272 0 emcpgpx 55376 3 emcpvlumd,emcpxcrypt,emcpdm emcpmpx 200872 0 emcp 2173088 5 emcpvlumd,emcpxcrypt,emcpdm,emcpgpx,emcpmpx dm_multipath 56921 0 scsi_dh 42177 1 dm_multipath video 53197 0 backlight 39873 1 video sbs 49921 0 power_meter 47053 0 hwmon 36553 1 power_meter i2c_ec 38593 1 sbs i2c_core 57537 1 i2c_ec dell_wmi 37601 0 wmi 41985 1 dell_wmi button 40545 0 battery 43849 0 asus_acpi 50917 0 acpi_memhotplug 40517 0 ac 38729 0 parport_pc 62313 0 lp 47121 0 parport 73165 2 parport_pc,lp joydev 43969 0 libiscsi2 77765 0 hpilo 44497 0 scsi_transport_iscsi2 73945 1 libiscsi2 tpm_tis 48077 0 serio_raw 40517 0 tpm 50273 1 tpm_tis scsi_transport_iscsi 35017 1 scsi_transport_iscsi2 tpm_bios 40897 1 tpm i7core_edac 46793 0 edac_mc 60449 1 i7core_edac pcspkr 36289 0 dm_raid45 99657 0 dm_message 36289 1 dm_raid45 dm_region_hash 46145 1 dm_raid45 dm_mem_cache 38977 1 dm_raid45 dm_snapshot 52233 0 dm_zero 35265 0 dm_mirror 54737 0 dm_log 44993 3 dm_raid45,dm_region_hash,dm_mirror dm_mod 101393 21 dm_multipath,dm_raid45,dm_snapshot,dm_zero,dm_mirror,dm_log megaraid_sas 88713 0 megaraid_mbox 65873 0 megaraid_mm 44793 1 megaraid_mbox megaraid 73897 0 mptspi 54609 0 scsi_transport_spi 59841 1 mptspi mptsas 84689 0 mptscsih 69697 2 mptspi,mptsas scsi_transport_sas 68801 1 mptsas mptbase 122757 3 mptspi,mptsas,mptscsih shpchp 70637 0 cciss 109001 3 sd_mod 56513 0 scsi_mod 199001 14 emcp,scsi_dh,libiscsi2,scsi_transport_iscsi2,megaraid_sas,megaraid_mbox,megaraid,mptspi,scsi_transport_spi,mptsas,mptscsih,scsi_transport_sas,cciss,sd_mod ext3 168913 7 jbd 94769 1 ext3 uhci_hcd 57433 0 ohci_hcd 56181 0 ehci_hcd 65997 0 modinfo ============== # modinfo be2net filename: /lib/modules/2.6.18-238.1.1.el5/kernel/drivers/net/benet/be2net.ko license: GPL author: ServerEngines Corporation description: ServerEngines BladeEngine 10Gbps NIC Driver 2.102.518r version: 2.102.518r srcversion: 76890C397EB8D93CCC6B539 alias: pci:v000019A2d00000710sv*sd*bc*sc*i* alias: pci:v000019A2d00000700sv*sd*bc*sc*i* alias: pci:v000019A2d00000221sv*sd*bc*sc*i* alias: pci:v000019A2d00000211sv*sd*bc*sc*i* depends: 8021q vermagic: 2.6.18-238.1.1.el5 SMP mod_unload gcc-4.1 parm: rx_frag_size:Size of a fragment that holds rcvd data. (uint) parm: num_vfs:Number of PCI VFs to initialize (uint) parm: lro:Obsolete, only for backward compatibility. Don't use. (uint) module_sig: 883f3504d236b8a286a51799a18fc80112691f0a093cafb72b2480ef1b97385edd1af6806c48bb220a0cf45e23fd4ded5becc5cb9b27491a2c6d52ea8c This is how the bonding configuration looks before panicking.. ================================================================== # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.4.0-1 (October 7, 2008) Bonding Mode: fault-tolerance (active-backup) Primary Slave: eth1 (primary_reselect always) Currently Active Slave: eth1 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 ARP Polling Interval (ms): 100 ARP IP target/s (n.n.n.n form): 192.168.1.1 Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link Failure Count: 46 Permanent HW addr: d4:85:64:4e:cd:a4 Slave Interface: eth3 MII Status: up Speed: 100 Mbps Duplex: full Link Failure Count: 3 Permanent HW addr: 78:e7:d1:5b:60:a9 Syslog Fragment: ======================== Jan 24 14:56:37 schhyt14 kernel: bonding: bond0: Warning: failed to get speed and duplex from eth3, assumed to be 100Mb/sec and Full. Jan 24 14:58:54 schhyt14 kernel: bonding: bond0: making interface eth1 the new active one. Jan 24 14:58:54 schhyt14 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 24 14:58:54 schhyt14 kernel: bonding: bond0: making interface eth3 the new active one. Jan 24 14:58:54 schhyt14 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 24 14:58:54 schhyt14 kernel: bonding: bond0: making interface eth1 the new active one. Jan 24 14:58:54 schhyt14 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 24 14:58:54 schhyt14 kernel: bonding: bond0: making interface eth3 the new active one. Jan 24 14:58:54 schhyt14 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 24 14:58:54 schhyt14 kernel: bonding: bond0: making interface eth1 the new active one. Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: making interface eth3 the new active one. Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: making interface eth1 the new active one. Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: making interface eth3 the new active one. Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: making interface eth1 the new active one. Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: making interface eth3 the new active one. Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 24 14:58:55 schhyt14 kernel: bonding: bond0: making interface eth1 the new active one. Jan 24 14:58:56 schhyt14 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 24 14:58:56 schhyt14 kernel: bonding: bond0: making interface eth3 the new active one. Jan 24 14:58:56 schhyt14 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 24 14:58:56 schhyt14 kernel: bonding: bond0: making interface eth1 the new active one. Jan 24 14:58:56 schhyt14 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Jan 24 14:58:56 schhyt14 kernel: bonding: bond0: making interface eth3 the new active one. Jan 24 14:58:56 schhyt14 kernel: bonding: bond0: link status definitely up for interface eth1. Jan 24 14:58:56 schhyt14 kernel: bonding: bond0: making interface eth1 the new active one.
Forgot to include just in case lspci output: # lspci |grep Eth 02:00.0 Ethernet controller: ServerEngines Corp. Emulex OneConnect 10Gb NIC (be3) (rev 01) 02:00.1 Ethernet controller: ServerEngines Corp. Emulex OneConnect 10Gb NIC (be3) (rev 01) 0c:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit Ethernet (rev a3) 0c:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit Ethernet (rev a3) 0e:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit Ethernet (rev a3) 0e:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit Ethernet (rev a3)
Just finished rebuilding the system with RHEL 5.6 (previously it was a kludge of RHEL5.5 with the 5.6 kernel package) using 2.6.18-238.el5 from scratch, same problem persists, as soon as i enable arp_monitoring on the bond0 interface the flapping begins.
One more thing. I can definitely say that this bug has something to do with be2net, i just created a new bond interface using only 2 tg3 NICs with arp_interval and there is absolutely no flapping. Unfortunately the HP Driver no longer compiles on RH5.6, otherwise i'd test it.
Could you please try to test the kernel available here: http://people.redhat.com/ivecera/rhel-5-ivtest/
Ivan: Your kernel seems to be working fine, can you tell me what's changed, or maybe describe the issue and fix(es) from a more technical point of view ? The only problem i see (and that i've also seen on 2.6.18-238 is that somehow bonding thinks that one of the slaves is down when it's not on the bonding interface which is NOT using arp_monitoring.. Perhaps it should be the subjecte of another bug report... See: # cat /proc/net/bonding/bond1 Ethernet Channel Bonding Driver: v3.4.0-1 (October 7, 2008) Bonding Mode: fault-tolerance (active-backup) Primary Slave: eth2 (primary_reselect always) Currently Active Slave: eth4 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 ARP Polling Interval (ms): 100 ARP IP target/s (n.n.n.n form): 10.201.12.1 Slave Interface: eth2 MII Status: down Speed: 100 Mbps Duplex: full Link Failure Count: 1 Permanent HW addr: 78:e7:d1:5b:60:a8 Slave Interface: eth4 MII Status: up Speed: 100 Mbps Duplex: full Link Failure Count: 1 Permanent HW addr: 78:e7:d1:5b:60:aa #ethtool eth2 Settings for eth2: Supported ports: [ FIBRE ] Supported link modes: 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: FIBRE PHYAD: 1 Transceiver: internal Auto-negotiation: on Supports Wake-on: g Wake-on: g Current message level: 0x000000ff (255) Link detected: yes # dmesg |ifconfig eth2 eth2 Link encap:Ethernet HWaddr 78:E7:D1:5B:60:A8 UP BROADCAST RUNNING SLAVE MULTICAST MTU:9000 Metric:1 RX packets:1 errors:0 dropped:0 overruns:0 frame:0 TX packets:7 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:64 (64.0 b) TX bytes:448 (448.0 b) Interrupt:61 Memory:fb9f0000-fba00000 # dmesg |grep eth2 eth2: Tigon3 [partno(N/A) rev 9003 PHY(5714)] (PCIX:133MHz:64-bit) 1000Base-SX Ethernet 78:e7:d1:5b:60:a8 eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[0] TSOcap[1] eth2: dma_rwctrl[76148000] dma_mask[40-bit] bonding: bond1: Adding slave eth2. bonding: bond1: Warning: failed to get speed and duplex from eth2, assumed to be 100Mb/sec and Full. bonding: bond1: making interface eth2 the new active one. bonding: bond1: enslaving eth2 as an active interface with an up link. bonding: bond1: Setting eth2 as primary slave. bonding: bond1: link status definitely down for interface eth2, disabling it tg3: eth2: Link is up at 1000 Mbps, full duplex. tg3: eth2: Flow control is off for TX and off for RX.
Hi Gerardo, the reason was simple, during the backporting of the be2net driver into RHEL5.5 somebody forgot to update 'netdev->trans_start' in be_xmit function. In the upstream kernel this is done automatically but in RHEL5 the network driver is responsible for this updating. Ivan
Thanks for the insight Ivan, when can we expect the fix to hit the mainstream kernel? Our management has given this new hardware certification top prio and this bug is a showstopper for us. Also i made a mistake describing the problem on my last post, the other bonding IS set to use arp monitoring (i incorrectly stated it wasn't). Furthermore if i set the bonding interface to use mode=6 (Adaptative Load Balancing) it detect link on all interfaces.
(In reply to comment #13) > Thanks for the insight Ivan, > when can we expect the fix to hit the mainstream kernel? > Our management has given this new hardware certification top prio and this bug > is a showstopper for us. > This will be fixed in 5.7 and probably also in 5.6 release.
in kernel-2.6.18-242.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
*** Bug 674051 has been marked as a duplicate of this bug. ***
kernel-2.6.18-238.5.1.el5.x86_64.rpm had been released. Question: Is it include this BZ fix?
HP Proliant Blade 460c-G7, RHELS 5.6 , test kernel 2.6.18-245.el5 , we just suffered a HP-VC crash but this time the Linux system did not crash and regained network connectivity (after HP-VC came back online) without the need to restart the network service. Mar 25 10:17:59 scomp1101 kernel: eth1: Link down Mar 25 10:17:59 scomp1101 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Mar 25 10:18:09 scomp1101 kernel: eth0: Link down Mar 25 10:18:09 scomp1101 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it Mar 25 10:18:10 scomp1101 kernel: bonding: bond0: now running without any active interface ! Mar 25 10:21:13 scomp1101 kernel: eth1: Link up Mar 25 10:21:14 scomp1101 kernel: bonding: bond0: link status definitely up for interface eth1. Mar 25 10:21:14 scomp1101 kernel: bonding: bond0: making interface eth1 the new active one. Mar 25 10:21:14 scomp1101 kernel: bonding: bond0: first active interface up! Mar 25 10:21:14 scomp1101 kernel: eth1: Link down Mar 25 10:21:14 scomp1101 kernel: bonding: bond0: link status definitely down for interface eth1, disabling it Mar 25 10:21:14 scomp1101 kernel: bonding: bond0: now running without any active interface ! Mar 25 10:21:49 scomp1101 kernel: eth1: Link up Mar 25 10:21:49 scomp1101 kernel: bonding: bond0: link status definitely up for interface eth1. Mar 25 10:21:49 scomp1101 kernel: bonding: bond0: making interface eth1 the new active one. Mar 25 10:21:49 scomp1101 kernel: bonding: bond0: first active interface up! Mar 25 10:21:58 scomp1101 kernel: eth0: Link up Mar 25 10:21:58 scomp1101 kernel: bonding: bond0: link status definitely up for interface eth0.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Prior to this update, the be2net driver failed to work with bonding, causing "flapping" errors (the interface switches between states up and down) in the active interface. This was due to the fact that the netdev->trans_start pointer in the be_xmit function was not updated. With this update, the aforementioned pointer has been properly updated and "flapping" errors no longer occur.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html