Bug 190262

Summary: channel bonding causes system lockup / freeze
Product: Red Hat Enterprise Linux 4 Reporter: Randy Zagar <zagar>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED CANTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: jbaron, mmahudha
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-10-12 15:07:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Kermel messages captured by syslog prior to latest lockup none

Description Randy Zagar 2006-04-29 16:05:57 UTC
Description of problem:

Having multiple lockups under RHEL-4-U3-i386 after enabling channel bonding.
Only seems to happen after extended high-speed transfers (> 20K packets/sec) and
only after a long period of time. currently using balance-rr module.

Version-Release number of selected component (if applicable):
2.6.9-34.ELsmp

How reproducible:
Three lockups this week on two different systems.  

Steps to Reproduce:
1. rsync large filesystems from multiple servers simultaneously
2. wait several hours
  
Actual results:
system locks up

Expected results:
system continues to operate normally

Additional info:
APPRO S2228X 2U server, Tyan S2721-533 motherboard, Intel FW82546EB gig-E
controller.  Systems have been operating reliably for 2 years prior to use of
bonding driver (balance-rr).

Will attach interesting syslog information after submission

Comment 1 Randy Zagar 2006-04-29 16:09:07 UTC
Created attachment 128396 [details]
Kermel messages captured by syslog prior to latest lockup

Comment 2 Dan Carpenter 2006-04-29 19:25:29 UTC
Obviously the hang is bad, but you have a 10.8.17.227 and that's messed up too.

Has 10.8.17.227 been messed up for a long time or is that something new this week?


Comment 3 Randy Zagar 2006-05-01 06:23:22 UTC
10.8.17.227 works just fine on my private non-routable network...
What do _you_ think should be wrong with this?

Comment 4 Jason Baron 2006-05-01 15:11:38 UTC
hmmm, i belive this should already be fixed in the beta. please try: 
http://people.redhat.com/~jbaron/rhel4/

Comment 5 Jason Baron 2006-05-31 18:05:03 UTC
Any test results?

Comment 6 John W. Linville 2006-06-02 14:13:41 UTC
The log looks more like an e1000 problem, although it is possible that the 
bonding transmit scheduler is interfering w/ the expected operation of the 
driver. 
 
Have you tried any other bonding modes?  Mode 2 or even mode 4 are likely to 
be "drop-in" replacements for mode 0.  Please note that after changing the 
"mode=..." option in modprobe.conf you will need to either reboot or 
explicitly remove the bonding module before reloading it. 
 
Just to be clear, are you able to successfully complete the same operation 
using only individual e1000 interfaces rather than a bond? 

Comment 7 John W. Linville 2006-06-02 14:40:34 UTC
The log output looks a lot like bug 182215 (a Fedora bug), FWIW... 

Comment 8 Randy Zagar 2006-06-02 16:45:49 UTC
Unfortunately, I no longer have e1000 systems available for testing.  I have had
two server failures and all my test equipment is now in full production.  For
those of you working for RedHat, there is an open service request with more
information.  It is Service Request 874560.

I do, however, have 4 HP DL360s with dual broadcom BCM5703X adapters.

I will try to reproduce the problem on those systems... 8 copies of netcat
shoving local:/dev/zero into remote:/dev/null ought to do the trick...

Comment 9 John W. Linville 2006-06-05 16:24:06 UTC
Randy, could you try "ethtool -K eth0 tso off" (replace eth0 as appropriate)?   
Please do so and post the results here...thanks! 

Comment 10 Mustafa Mahudhawala 2006-06-14 06:46:26 UTC
Copy/Posting from BZ #194460 (for FC5)

One of our test (internal) file servers had the same problem yesterday, and it
took the network down along with it as well (very serious) ..

Only eth0 i.e. onboard 82573V was in use at the time of the problem.
Currently this interface has been downed, and the server is currently running of
the other onboard NIC.

# lspci
00:00.0 Host bridge: Intel Corporation E7230 Memory Controller Hub
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1
(rev 01)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 6 (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI
Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface
Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller
(rev 01)
00:1f.2 SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) Serial ATA
Storage Controller AHCI (rev 01)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01)
03:00.0 Ethernet controller: Intel Corporation 82573V Gigabit Ethernet
Controller (Copper) (rev 03)
04:04.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
04:05.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet
Controller (rev 05)

# lspci -n
00:00.0 Class 0600: 8086:2778
00:1c.0 Class 0604: 8086:27d0 (rev 01)
00:1c.4 Class 0604: 8086:27e0 (rev 01)
00:1c.5 Class 0604: 8086:27e2 (rev 01)
00:1d.0 Class 0c03: 8086:27c8 (rev 01)
00:1d.1 Class 0c03: 8086:27c9 (rev 01)
00:1d.2 Class 0c03: 8086:27ca (rev 01)
00:1d.3 Class 0c03: 8086:27cb (rev 01)
00:1d.7 Class 0c03: 8086:27cc (rev 01)
00:1e.0 Class 0604: 8086:244e (rev e1)
00:1f.0 Class 0601: 8086:27b8 (rev 01)
00:1f.1 Class 0101: 8086:27df (rev 01)
00:1f.2 Class 0106: 8086:27c1 (rev 01)
00:1f.3 Class 0c05: 8086:27da (rev 01)
03:00.0 Class 0200: 8086:108b (rev 03)
04:04.0 Class 0300: 1002:515e (rev 02)
04:05.0 Class 0200: 8086:1076 (rev 05)

ifconfig before taking down the problem NIC:

# cat ifconfig.out 
eth0      Link encap:Ethernet  HWaddr 00:13:20:D6:AD:E3
          inet addr:10.65.6.1  Bcast:10.65.6.255  Mask:255.255.255.0
          inet6 addr: fe80::213:20ff:fed6:ade3/64 Scope:Link
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:84132882 errors:297966072 dropped:297966072
overruns:297966072 frame:0
          TX packets:10677632885 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:75213992657 (70.0 GiB)  TX bytes:854693824469 (795.9 GiB)
          Base address:0x2000 Memory:88100000-88120000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:6082 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6082 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:868538 (848.1 KiB)  TX bytes:868538 (848.1 KiB)

# ethtool -e eth0
Offset          Values
------          ------
0x0000          00 13 20 d6 ad e3 30 0b 46 f7 01 10 ff ff ff ff
0x0010          ff ff ff ff 6b 02 a3 30 86 80 8b 10 86 80 de 80
0x0020          00 00 00 20 14 7e 00 00 00 00 d8 00 00 00 00 27
0x0030          c9 6c 50 31 22 07 0b 04 84 09 00 00 00 c0 06 07
0x0040          08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff
0x0050          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0060          00 01 00 40 1c 12 07 40 ff ff ff ff ff ff ff ff
0x0070          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 22 57
# ethtool -e eth1
Offset          Values
------          ------
0x0000          00 13 20 d6 ad e4 10 02 ff ff 00 10 ff ff ff ff
0x0010          ff ff ff ff 0b 64 a1 30 86 80 76 10 86 80 84 b2
0x0020          dd 20 22 22 00 00 90 2f 80 23 12 00 20 1e 12 00
0x0030          20 1e 12 00 20 1e 12 00 20 1e 09 00 00 02 00 00
0x0040          0c 00 a6 93 0b 28 00 00 00 04 ff ff ff ff ff ff
0x0050          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 02 06
0x0060          00 01 00 40 1c 12 07 40 ff ff ff ff ff ff ff ff
0x0070          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 83 18

# uname -rmpio
2.6.9-34.ELsmp x86_64 x86_64 x86_64 GNU/Linux

Comment 11 John W. Linville 2006-08-14 20:07:20 UTC
Bug 194460 is closed as UPSTREAM.  Can you try the test kernels here?

   http://people.redhat.com/linville/kernels/rhel4/

Those have a very late driver from upstream.  Please give them a try and post 
the results here...thanks!

Comment 12 John W. Linville 2006-10-12 15:07:47 UTC
Closed due to lack of response.  Please reopen when the requested information 
becomes available...thanks!