From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686) Gecko/20030807 Galeon/1.3.5
Description of problem:
Kernels built with gcc-2.96-113 are vulnerable to a deadlock when the current
process has traversed 5 successive symlinks on an NFS file system, and the
kernel takes an IRQ. There being less than 1KiB remainingon the process's kernel
stack, the service routine prints a stack trace rather than servicing the
interrupt. If the systemconsoleis on a serial port, this results in more IRQs
from the UART, which results in a deadlock.
Somehow, gcc-2.96-113 makes this condition far more likely. Kernels built with
gcc-2.96-112 or gcc3 do not exhibit the problem. Thedefault rsize/wsize of 4KiB
makes he problem less likely to occur also. With rsize=wsize=16KiB, the problem
shows up reliably.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. On an NFS file system with rsize-wsize=16KiB, run the attached perl script
with --jobs=30 --net.
2. Start three to eight large transfers off the box using scp
3. Watch the serial console
Actual Results: Kernel stack trace messages appear on the serial console and
the system deadlocks
Expected Results: Load should go up to above 30. The large copies should run to
completion. The system should stay up.
Additional info: Sample stack trace
Created attachment 95517 [details]
Script to deadlock kernels built with gcc-2.96-113
With cwd in a NFS mount with rsize=wsize=16KiB, run this script with --jobs=30
Start 3-8 large file transfers off the box. (I use scp with a 500MiB file.)
Watch the serial console. The system will deadlock before the transferrs
Created attachment 95518 [details]
Sample console messages when kernel deadlocks.
I placed the severity at "security" because this is essentially a
denial-of-service attack on the affected kernel. A local user with normal
privileges can deadlock the system.
I noticed that a new kernel errata,
https://rhn.redhat.com/errata/RHBA-2003-394.html, mentions this bug...
but it doesn't say way. Does this mean that the kernel *wasn't*
compiled with gcc-2.96-113? If so, what version does Red Hat recommend?
The new kernel was compiled with gcc-2.96-126, which is an unreleased
version. I haven't tested this kernel yet, but I will soon. Since they
call out this bug, I'm assuming the unreleased gcc addresses the issue.
But if you want to build your own kernel, the workaround of
downgrading to gcc-2.96-112 is the only solution I'm aware of. Perhaps
Red Hat will fix bug #87659 by releasing the updated gcc before 7.3
end-of-life next week. The fix for this bug is well over half a loaf,
however, since most installations won't be running custom kernels.
I haven't found a gcc-2.96-124 myself, but I _did_ find a SRPM of
gcc-2.96-124 as an update to the 2.1 enterprise version of RHL.
One of the changes from -113 to -124 is apparently a fix for some
"excessive stack usage caused by the -fno-strict-aliasing patch".
Doesn't that sound relevant ? I'm no expert, but I think it would
be interesting to take the relevant new patches from gcc-2.96-124
and stick them into gcc-2.96-113, rebuild gcc, and use that to rebuild
The patch you are apparantly referring to:
gcc-strict-alias-optimization2.patch, seems to address the problem
when applied to the gcc-2.96-113 SRPM. At least, a variant of my crash
script doesn't crash a kernel built with the resulting gcc. The pach
applied cleanly and the gcc build went smoothly. However I'm not in a
position to judge if this patch, applied in isolation, is a good
general fix for production systems.
One for Progeny, I guess.
*** Bug 91566 has been marked as a duplicate of this bug. ***