Red Hat Bugzilla – Bug 190262
channel bonding causes system lockup / freeze
Last modified: 2007-11-30 17:07:24 EST
Description of problem:
Having multiple lockups under RHEL-4-U3-i386 after enabling channel bonding.
Only seems to happen after extended high-speed transfers (> 20K packets/sec) and
only after a long period of time. currently using balance-rr module.
Version-Release number of selected component (if applicable):
Three lockups this week on two different systems.
Steps to Reproduce:
1. rsync large filesystems from multiple servers simultaneously
2. wait several hours
system locks up
system continues to operate normally
APPRO S2228X 2U server, Tyan S2721-533 motherboard, Intel FW82546EB gig-E
controller. Systems have been operating reliably for 2 years prior to use of
bonding driver (balance-rr).
Will attach interesting syslog information after submission
Created attachment 128396 [details]
Kermel messages captured by syslog prior to latest lockup
Obviously the hang is bad, but you have a 10.8.17.227 and that's messed up too.
Has 10.8.17.227 been messed up for a long time or is that something new this week?
10.8.17.227 works just fine on my private non-routable network...
What do _you_ think should be wrong with this?
hmmm, i belive this should already be fixed in the beta. please try:
Any test results?
The log looks more like an e1000 problem, although it is possible that the
bonding transmit scheduler is interfering w/ the expected operation of the
Have you tried any other bonding modes? Mode 2 or even mode 4 are likely to
be "drop-in" replacements for mode 0. Please note that after changing the
"mode=..." option in modprobe.conf you will need to either reboot or
explicitly remove the bonding module before reloading it.
Just to be clear, are you able to successfully complete the same operation
using only individual e1000 interfaces rather than a bond?
The log output looks a lot like bug 182215 (a Fedora bug), FWIW...
Unfortunately, I no longer have e1000 systems available for testing. I have had
two server failures and all my test equipment is now in full production. For
those of you working for RedHat, there is an open service request with more
information. It is Service Request 874560.
I do, however, have 4 HP DL360s with dual broadcom BCM5703X adapters.
I will try to reproduce the problem on those systems... 8 copies of netcat
shoving local:/dev/zero into remote:/dev/null ought to do the trick...
Randy, could you try "ethtool -K eth0 tso off" (replace eth0 as appropriate)?
Please do so and post the results here...thanks!
Copy/Posting from BZ #194460 (for FC5)
One of our test (internal) file servers had the same problem yesterday, and it
took the network down along with it as well (very serious) ..
Only eth0 i.e. onboard 82573V was in use at the time of the problem.
Currently this interface has been downed, and the server is currently running of
the other onboard NIC.
00:00.0 Host bridge: Intel Corporation E7230 Memory Controller Hub
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 6 (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI
Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface
Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller
00:1f.2 SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) Serial ATA
Storage Controller AHCI (rev 01)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01)
03:00.0 Ethernet controller: Intel Corporation 82573V Gigabit Ethernet
Controller (Copper) (rev 03)
04:04.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
04:05.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet
Controller (rev 05)
# lspci -n
00:00.0 Class 0600: 8086:2778
00:1c.0 Class 0604: 8086:27d0 (rev 01)
00:1c.4 Class 0604: 8086:27e0 (rev 01)
00:1c.5 Class 0604: 8086:27e2 (rev 01)
00:1d.0 Class 0c03: 8086:27c8 (rev 01)
00:1d.1 Class 0c03: 8086:27c9 (rev 01)
00:1d.2 Class 0c03: 8086:27ca (rev 01)
00:1d.3 Class 0c03: 8086:27cb (rev 01)
00:1d.7 Class 0c03: 8086:27cc (rev 01)
00:1e.0 Class 0604: 8086:244e (rev e1)
00:1f.0 Class 0601: 8086:27b8 (rev 01)
00:1f.1 Class 0101: 8086:27df (rev 01)
00:1f.2 Class 0106: 8086:27c1 (rev 01)
00:1f.3 Class 0c05: 8086:27da (rev 01)
03:00.0 Class 0200: 8086:108b (rev 03)
04:04.0 Class 0300: 1002:515e (rev 02)
04:05.0 Class 0200: 8086:1076 (rev 05)
ifconfig before taking down the problem NIC:
# cat ifconfig.out
eth0 Link encap:Ethernet HWaddr 00:13:20:D6:AD:E3
inet addr:10.65.6.1 Bcast:10.65.6.255 Mask:255.255.255.0
inet6 addr: fe80::213:20ff:fed6:ade3/64 Scope:Link
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:84132882 errors:297966072 dropped:297966072
TX packets:10677632885 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:75213992657 (70.0 GiB) TX bytes:854693824469 (795.9 GiB)
Base address:0x2000 Memory:88100000-88120000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:6082 errors:0 dropped:0 overruns:0 frame:0
TX packets:6082 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:868538 (848.1 KiB) TX bytes:868538 (848.1 KiB)
# ethtool -e eth0
0x0000 00 13 20 d6 ad e3 30 0b 46 f7 01 10 ff ff ff ff
0x0010 ff ff ff ff 6b 02 a3 30 86 80 8b 10 86 80 de 80
0x0020 00 00 00 20 14 7e 00 00 00 00 d8 00 00 00 00 27
0x0030 c9 6c 50 31 22 07 0b 04 84 09 00 00 00 c0 06 07
0x0040 08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff
0x0050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0060 00 01 00 40 1c 12 07 40 ff ff ff ff ff ff ff ff
0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 22 57
# ethtool -e eth1
0x0000 00 13 20 d6 ad e4 10 02 ff ff 00 10 ff ff ff ff
0x0010 ff ff ff ff 0b 64 a1 30 86 80 76 10 86 80 84 b2
0x0020 dd 20 22 22 00 00 90 2f 80 23 12 00 20 1e 12 00
0x0030 20 1e 12 00 20 1e 12 00 20 1e 09 00 00 02 00 00
0x0040 0c 00 a6 93 0b 28 00 00 00 04 ff ff ff ff ff ff
0x0050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 02 06
0x0060 00 01 00 40 1c 12 07 40 ff ff ff ff ff ff ff ff
0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 83 18
# uname -rmpio
2.6.9-34.ELsmp x86_64 x86_64 x86_64 GNU/Linux
Bug 194460 is closed as UPSTREAM. Can you try the test kernels here?
Those have a very late driver from upstream. Please give them a try and post
the results here...thanks!
Closed due to lack of response. Please reopen when the requested information