From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.14; Mac_PowerPC)
Description of problem:
I have installed the latest gcc and glibc on all our systems. On a number
of heavily network loaded systems i am seeing stack traces (do_IRQ) and
the machine would crash randomly or worse, would just hang.
after thinking this was a problem with network drivers and/or with entropy
handling it appears to have been narrowed down to a bug in the compiler
or glibc or something there when you build a custom kernel.
i reverted to a binary kernel (2.4.18-27.7.xsmp) and the problem went
of course this is not optimal since we have always been able to
successfully compile custom kernels in the past and this also means we
cannot add any patches or other changes as a custom kernel now fails..
i am about 90% sure it is related to the compiler but who knows..
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. compile custom kernel and install
2. apply heavy network load
Actual Results: stack traces in syslog and machine hangs
Expected Results: machine shouldn't have been crashing
i have not had a chance to test this with gcc-112 but since that
was the previous compiler version i used to compile the 2.4.18-18.7.x
kernels which didn't exhibit this bug i am guessing that something in gcc-
113 (and it's associated software) is broken for kernel compiles.
note this seems to also affect redhat 7.2 (at that compiler version..)
We see this bug too. For us, it shows up using NFS alone, or in combination with
mvfs.o. The kernel stack traces occur when the system takes an IRQ while the
current process has traversed five successive symlinks. The kernel refuses to
service the IRQ because there is less than 1KiB of kernel stack left.
Recompiling with gcc-2.96-112 fixes the problem.
The latest errata kernel for 7.x is vulnerable to this issue because it was
built with gcc-2.96-113
Created attachment 95269 [details]
Perl script to deadlock kernels built with gcc-2.96-113
Run this script with '--jobs=40 --net' with cwd in an NFS mount that has
rsize=wsize=16KiB or greater. Start 3-8 large scp transfers out from the box.
Watch the console. If you have a serial console, the system will deadlock. If
not, it may stay up, but you will still see the stack traces.
Jason, or someone, could you please change the summary of this bug to read
something like "gcc-2.96-113 produces broken kernels (DO_IRQ kernel stack trace
deadlocks)"? I might have found this bug three weeks ago when I started
investigating if it had had a summmary like that. 8)
This bug lists gcc-2.96-112 as safe, but one of my 7.2 systems running
the 2.4.18-27 kernel rebuilt with gcc-2.96-112 just died with the
do_IRQ: stack overflow errors.
As a sanity check, what does 'strings /boot/vmlinux-2.4.18-27.7x |
grep gcc' show?
Also, do_IRQ refusing to service an interrupt because there's not
enough space on the stack can happen for other reasons. Can you make
the fault happen with the attached Perl script?
Red Hat today released kernel-2.4.20-27.7, which fixes bug #108092,
one of the dependent bugs of this bug. 'strings
/boot/vmlinux-2.4.20-27.7 | grep gcc' shows that the new kernel was
built with gcc 2.96-126, which isn't released AFAIK. Here's hoping Red
Hat releases this gcc version before end-of-life of 7.3. (Or,
alternatively, *after* that date. 8)
On a hint supplied by Erling Jacobsen over on bug #108092, I applied
gcc-strict-alias-optimization2.patch from the RHEL 2.1 gcc-2.96-124
source RPM to the 7.3 gcc-2.96-113 sources. The patch applied cleanly,
and a kernel built with the resulting gcc failed to crash when
subjected to a variant of my kernel stack crash script.
It appears this patch addresses the problem this bug refers to.
Knowing this is not as helpful as it could be however, because the
effect of applying this patch in isolation isn't known.
Yes, the networking code is known to have strict-aliasing violations.
The kernel makefiles have been updated since then to always supply