Red Hat Bugzilla – Bug 200885
kernel BUG triggered by Java app on x86_64 (32GB; 16-core Tyan VX50)
Last modified: 2012-06-20 09:26:59 EDT
Description of problem:
A "workflow" of a Java user crashes our server reproducibly. The server, which
runs at update 3, is otherwise stable over weeks.
Version-Release number of selected component (if applicable):
kernel-largesmp-2.6.9-34.0.2.EL. I also tried kernel-largesmp-2.6.9-42.EL
from J.Baron's website, with the same result.
Steps to Reproduce:
1. user loads Java app
crash as shown in attached log
dmesg output is at http://strucbio.biologie.uni-konstanz.de/~kay/dmesg
the crashlog is also at http://strucbio.biologie.uni-konstanz.de/~kay/crash2 ,
and a kernel-largesmp-2.6.9-42.EL crashlog is at
http://strucbio.biologie.uni-konstanz.de/~kay/crash (the latter is tainted by
VMware modules, but the crashes are unrelated)
Created attachment 133387 [details]
log obtained by netconsole
Created attachment 133388 [details]
Created attachment 133389 [details]
output of "top" when machines crashes, note the high "sy" fraction
Created attachment 133390 [details]
output of "vmstat 1" when machine crashes
looks like cpu 5 is spinningi on the sighand->siglock. The question is who is
holding so long....We could try a debug patch to figure this out...if you can
reliably reproduce it...is it easier for us to give you a debug kernel, or is it
possible that we can get the reproducer?
A debug kernel would be fine.
hmmm...can you set up netdump? it looks like you already have netconsole
this would really help us narrow down the casue possibly without rely on a debug
my intention had indeed been to capture vmcore on the netdump server. However,
so far only a file
-rw------- 1 netdump netdump 0 Aug 1 12:44 vmcore-incomplete
is written, whereas the "log" indeed is captured ok in the same directory. What
could prevent vmcore to be written? The only lines in /etc/sysconfig/netdump on
the client that do not start with "# " are
where aaa.bbb.ccc.ddd are actually the IP of the netdump-server, which is a
fully updated RHEL4 clone running 2.6.9-22.0.1.EL. I realize that the SYSLOGADDR
does nothing useful, as syslogd on the netdump-server is not running with -r.
However, the netlog portion of netdump does work. Could it be because the server
has only 14G free on the partition which has /var/crash, whereas the client has
32GB of memory?
So, before I can provide a vmcore I need help with that. Thanks.
> Could it be because the server has only 14G free on the partition
> which has /var/crash, whereas the client has 32GB of memory?
That is correct -- before attempting to create the vmcore file
the netdump-server does a space check is made on that partition.
If that check fails (or any of serveral other error conditions
occur), the vmcore creation is suspended, an error message is
logged, and the dumpfile is left named "vmcore-incomplete".
If you look in the /var/log/messages file on the netdump-server,
you will find the reason for failure. Although there are several
error possibilities, you'll probably see a "No space for dump image"
message generated by the netdump-server daemon.
BTW, if this can be reproduced reliably with a smaller amount
of memory, you could try booting the kernel with the "mem=Xg"
command line argument, where X is the number of gigabytes you'd
like to restrict the kernel to use. Perhaps "mem=8g" might be
be sufficient to reproduce the bug? That would reduce the
possibility of any other vmcore-creation failure and make it
it a hell of lot easier to ship the vmcore around.
Ok, I found the netdump-server lines in the server's /var/log/messages which
indicated that the space was insufficient.
I moved /var/crash to an empty partition so that should no longer be a problem.
We will try to reproduce the problem tomorrow morning, with mem=8G .
I have a 8GB vmcore (vmcore.bz2 is only 1.1GB), and the "log", from the
2.6.9-42.ELlargesmp kernel from Jason's website. Where can I ftp the files?
Any chance you have a publically-available web site
that it can be downloaded from (at least temporarily)?
Normally vmcore transitions go through our support engineering
ftp site, but that might require that this bugzilla be
escalated via your support contract. They will open an
issue tracker, and handle all the details:
Red Hat Global Support Services at 1-888-RED-HAT1
I also now got a 4GB vmcore (.bz2 is only 441805515 bytes) from netdump, and its
"log". This is from the 2.6.9-34.0.2 kernel (CentOS-compiled).
One more datapoint: we were unable to crash the 18.104.22.168 kernel , so it does
appear to be a 2.6.9 bug.
I put the files into http://strucbio.biologie.uni-konstanz.de/~kay/crashdumps.1
and crashdumps.2 , respectively .
I'll remove the directories
http://strucbio.biologie.uni-konstanz.de/~kay/crashdumps.1 (2.6.9-42 kernel, 8GB
http://strucbio.biologie.uni-konstanz.de/~kay/crashdumps.2 (2.6.9-34.0.2 kernel,
4 GB memory) tomorrow.
Looking forward to a fix (other than 22.214.171.124)!
after looking at this a bit, i was wondering if there is just a lot of lock
contention, and that disabling the nmi_watchdog by defuault would make the panic
go away....thus, can you please try booting with: "nmi_watchdog=0" at the kernel
command line and see if you can still reproduce the crash. thanks.
I disabled the nmi_watchdog and indeed, since then, the Java user has not been
able to trigger the problem! It is maybe a bit early to be entirely sure but
according to our experience so far this is a significant progress.
A couple of questions arise:
a) Is there a difference between the nmi_watchdog in the vanilla kernel (which
survived the "Java test") and the RHEL-2.6.9-42 one (which panic-ed)?
b) Should I use "hangcheck_timer" instead (this is what Oracle - which is not
used on this machine - appears to recommend)?
c) Should I enable the "Watchdog Timer" in the BIOS of the machine?
d) could the 5-second timeout of the nmi_watchdog be prolonged to make it useful
again on this machine?
e) are there any other recommendations or known caveats for a 16core/32G machine?
Anyway, thanks a lot for finding the bit that appears to have caused the trouble!
ok, thanks for the feedback. However, despite the fact that there is no panic
there still is a real problem here. The fact that the nmi watchdog can trigger
b/c a processor has interrupts disabled for 5+ seconds is unacceptable. This may
however, be a hardware issue in that all processor aren't being treated
fairly....that would require a bit more investigation...I'm curious if there are
still 'lost timer' messages in /var/log/messages...and also if there are any
noticeable problems that you have observed.
a) i don't think the difference in the nmi_watchdog is the cause this problem,
but they are likely to be a bit different since the source base is different.
b) 'hangcheck_timer' works by setting kernel timers and making sure they fire
within a certain interval, it does try to detect lockups but works quite
differently from the nmi watchdog...
c) i guess that's a h/w watchdog, don't know much about it...
d) that's a thought, perhaps the timeout needs to scale with the number of
e) this the first time i've seen this problem on rhel4 x86_64, however i have
heard of similar issues with Numa memory systems where memory access in not
uniform and thus different processors don't share locks fairly.
when you say "This may however, be a hardware issue in that all processor aren't
being treated fairly" could it have something to do with the "flat" APIC model
(I don't understand the technical details) which 2.6.17 has and which 2.6.9 doesn't?
I'm not sure that the mailing-list discussion is very enlightening. First, the
topology of CPU connections in 8- (see
ftp://ftp.tyan.com/datasheets/d_s4881_104.pdf ) and 4-way servers is different
as a Opteron 8xx has only 3 HT channels. It is clear that in a 8-way server CPUs
are up to 3 hops away (on average 1.786), whereas in a 4-way that would be 1 or
2 hops (on average 1.333). Benchmark results will thus depend a lot on which
CPUs are involved. Second, the theoretical maximum speed of DDR400 memory is
6.4GB/s; Hypertransport (HT1000) between the CPUs will be limiting this to 2GB/s
for each link, and I wouldn't expect that one can simply add up all the maximum
numbers for all CPUs, unless a lot of work is devoted into the setup of the
benchmark experiment. Third, the "benchmark" numbers do not depend on the kernel
Possibly not very relevant for this bug: I've started to worry a bit about
whether running irqbalance is a good idea on this server, as only cores 0-3
should handle interrupts for disks and network. /proc/interrupts shows that
currently also some of the CPUs with high numbers are engaged in interrupt
handling, and some CPUs (1,5,7,9,11-15) not at all. Should I switch to
"ONESHOT=yes" in /etc/syconfig/irqbalance, or use a different method?
Concerning my question having to do with the "flat APIC" - this was referring to
Concerning the kernel warning "many lost ticks": this appears usually after the
kernel boots, and never comes again so I think it is harmless.
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life.
Please See https://access.redhat.com/support/policy/updates/errata/
If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.