Description of problem: sending >4M messages with netpipe-2.4 hangs/kills the kernel. netpipe is a common network latency and bandwidth measuring tool. When transmitting with netpipe-2.4 and the message size ramps up past 4M (always by 8M, sometimes by 4M or 6M) it hangs the 358, 414 and 422 builds of the i686 kernel on 3 separate machines with 3 different sorts of network cards talking to different clients over different switches. the receiving machine is fine. the hung (transmitting) machine is pingable and alt-sysrq works, but nothing else does - no ctrl-alt-del, no console switching, no mouse movement, no ctrl-alt-f1, no text entry, no remote login. Rebuilding any of these kernel rpms (we have rebuilt 414 and 422) without the 4g/4g patch (patch520) fixes the problem. We removed all the patches one by one and the 4g/4g patch is the one that is broken. this doesn't happen on build 358 with an x86_64 machine in 64bit mode, but does on the same machine in 32bit mode. it also doesn't happen with fc1 or with stock kernel.org 2.6.6 kernels. this, plus the range of machines tested means it's not a hardware problem. the kernels are uni-processor with only the standard fc2 boot options (init level 3 usually) and no tainting. I think the kernel is some sort of an infinite loop rather than actually dead (you can ping and sysrq still), but I don't know the right way to describe it... hung? deadlock? you can alt-sysrq-t to get a trace of the functions, but I have no way to copy down this information to send it to you. Version-Release number of selected component (if applicable): builds 358 (fc2 release), 414, 422 How reproducible: Always Steps to Reproduce: 1. install fc2 2. get netpipe from http://www.scl.ameslab.gov/netpipe/code/NetPIPE_2.4.tar.gz 3. unpack, make 4. run "NPtcp -r -l 4194304" on some other machine 5. run "NPtcp -t -h otherMachine -l 4194304 -P" on the fc2 machine 6. wait for a while until 6 or 8M messages start 7. kernel hangs Actual Results: first 3 to 8 lines of the below, then hang Expected Results: % ./NPtcp -t -h lynx -l 4194304 -P Latency: 0.000065 Now starting main loop 0: 4194301 bytes 7 times --> 89.50 Mbps in 0.357560 sec 1: 4194304 bytes 7 times --> 89.49 Mbps in 0.357596 sec 2: 4194307 bytes 7 times --> 89.47 Mbps in 0.357658 sec 3: 6291453 bytes 7 times --> 89.51 Mbps in 0.536236 sec 4: 6291456 bytes 7 times --> 89.45 Mbps in 0.536634 sec 5: 6291459 bytes 7 times --> 89.54 Mbps in 0.536081 sec 6: 8388605 bytes 7 times --> 89.48 Mbps in 0.715248 sec 7: 8388608 bytes 7 times --> 89.44 Mbps in 0.715546 sec 8: 8388611 bytes 7 times --> 89.54 Mbps in 0.714799 sec 9: 10485757 bytes 7 times --> 89.47 Mbps in 0.894201 sec 10: 10485760 bytes 7 times --> 89.28 Mbps in 0.896100 sec 11: 10485763 bytes 7 times --> 89.47 Mbps in 0.894173 sec 12: 14680061 bytes 7 times --> 89.42 Mbps in 1.252467 sec 13: 14680064 bytes 7 times --> 89.47 Mbps in 1.251798 sec 14: 14680067 bytes 7 times --> 89.44 Mbps in 1.252292 sec % Additional info: the -l 4194304 to netpipe is optional and just saves time waiting for the small messages to complete. latest netpipe (3.6.1) http://www.scl.ameslab.gov/netpipe/ seem to have a workaround for this Linux bug and doesn't kill the kernel. You run that with just "NPtcp" (recv) and "NPtcp -h otherMachine" (transmit)
Have you tried newer kernels?
I just tried it on 2.6.8-1.521 and on 2.6.7-1.494.2.2 and the machines didn't die. so someone has magically fixed the bug in the 3 months that this bug report was sitting here. cool :-)