Bug 125738

Summary: network load and 4G/4G patch kills the kernel
Product: [Fedora] Fedora Reporter: Robin Humble <humble+fedora>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 2CC: byte, pfrields, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.8-1.521 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-10-07 05:06:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robin Humble 2004-06-10 18:35:57 UTC
Description of problem:
sending >4M messages with netpipe-2.4 hangs/kills the kernel.

netpipe is a common network latency and bandwidth measuring tool.
When transmitting with netpipe-2.4 and the message size ramps up past
4M (always by 8M, sometimes by 4M or 6M) it hangs the 358, 414 and 422
builds of the i686 kernel on 3 separate machines with 3 different
sorts of network cards talking to different clients over different
switches. the receiving machine is fine.
the hung (transmitting) machine is pingable and alt-sysrq works, but
nothing else does - no ctrl-alt-del, no console switching, no mouse
movement, no ctrl-alt-f1, no text entry, no remote login.

Rebuilding any of these kernel rpms (we have rebuilt 414 and 422)
without the 4g/4g patch (patch520) fixes the problem. We removed all
the patches one by one and the 4g/4g patch is the one that is broken.

this doesn't happen on build 358 with an x86_64 machine in 64bit mode,
but does on the same machine in 32bit mode. it also doesn't happen
with fc1 or with stock kernel.org 2.6.6 kernels. this, plus the range
of machines tested means it's not a hardware problem.

the kernels are uni-processor with only the standard fc2 boot options
(init level 3 usually) and no tainting.

I think the kernel is some sort of an infinite loop rather than
actually dead (you can ping and sysrq still), but I don't know the
right way to describe it... hung? deadlock?
you can alt-sysrq-t to get a trace of the functions, but I have no way
to copy down this information to send it to you.


Version-Release number of selected component (if applicable):
builds 358 (fc2 release), 414, 422

How reproducible:
Always

Steps to Reproduce:
1. install fc2
2. get netpipe from
http://www.scl.ameslab.gov/netpipe/code/NetPIPE_2.4.tar.gz
3. unpack, make
4. run "NPtcp -r -l 4194304" on some other machine
5. run "NPtcp -t -h otherMachine -l 4194304 -P" on the fc2 machine
6. wait for a while until 6 or 8M messages start
7. kernel hangs


Actual Results:  first 3 to 8 lines of the below, then hang


Expected Results:  % ./NPtcp -t -h lynx -l 4194304 -P
Latency: 0.000065
Now starting main loop
  0:   4194301 bytes    7 times -->   89.50 Mbps in 0.357560 sec
  1:   4194304 bytes    7 times -->   89.49 Mbps in 0.357596 sec
  2:   4194307 bytes    7 times -->   89.47 Mbps in 0.357658 sec
  3:   6291453 bytes    7 times -->   89.51 Mbps in 0.536236 sec
  4:   6291456 bytes    7 times -->   89.45 Mbps in 0.536634 sec
  5:   6291459 bytes    7 times -->   89.54 Mbps in 0.536081 sec
  6:   8388605 bytes    7 times -->   89.48 Mbps in 0.715248 sec
  7:   8388608 bytes    7 times -->   89.44 Mbps in 0.715546 sec
  8:   8388611 bytes    7 times -->   89.54 Mbps in 0.714799 sec
  9:  10485757 bytes    7 times -->   89.47 Mbps in 0.894201 sec
 10:  10485760 bytes    7 times -->   89.28 Mbps in 0.896100 sec
 11:  10485763 bytes    7 times -->   89.47 Mbps in 0.894173 sec
 12:  14680061 bytes    7 times -->   89.42 Mbps in 1.252467 sec
 13:  14680064 bytes    7 times -->   89.47 Mbps in 1.251798 sec
 14:  14680067 bytes    7 times -->   89.44 Mbps in 1.252292 sec
%


Additional info:

the -l 4194304 to netpipe is optional and just saves time waiting for
the small messages to complete.
latest netpipe (3.6.1) http://www.scl.ameslab.gov/netpipe/ seem to
have a workaround for this Linux bug and doesn't kill the kernel. You
run that with just "NPtcp" (recv) and "NPtcp -h otherMachine" (transmit)

Comment 1 Warren Togami 2004-09-09 10:21:32 UTC
Have you tried newer kernels?


Comment 2 Robin Humble 2004-09-10 02:50:00 UTC
I just tried it on 2.6.8-1.521 and on 2.6.7-1.494.2.2 and the machines
didn't die. so someone has magically fixed the bug in the 3 months
that this bug report was sitting here. cool :-)