Bug 200885
Summary: | kernel BUG triggered by Java app on x86_64 (32GB; 16-core Tyan VX50) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Kay Diederichs <kay.diederichs> | ||||||||||
Component: | kernel | Assignee: | Bhavna Sarathy <bnagendr> | ||||||||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||||||||
Severity: | urgent | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 4.0 | CC: | anderson, cmc, mingo, peterm, ralf.stadelhofer, roland | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2012-06-20 13:26:59 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Kay Diederichs
2006-08-01 11:28:22 UTC
Created attachment 133387 [details]
log obtained by netconsole
Created attachment 133388 [details]
dmesg output
Created attachment 133389 [details]
output of "top" when machines crashes, note the high "sy" fraction
Created attachment 133390 [details]
output of "vmstat 1" when machine crashes
looks like cpu 5 is spinningi on the sighand->siglock. The question is who is holding so long....We could try a debug patch to figure this out...if you can reliably reproduce it...is it easier for us to give you a debug kernel, or is it possible that we can get the reproducer? A debug kernel would be fine. hmmm...can you set up netdump? it looks like you already have netconsole working. thanks. this would really help us narrow down the casue possibly without rely on a debug kernel. my intention had indeed been to capture vmcore on the netdump server. However, so far only a file -rw------- 1 netdump netdump 0 Aug 1 12:44 vmcore-incomplete is written, whereas the "log" indeed is captured ok in the same directory. What could prevent vmcore to be written? The only lines in /etc/sysconfig/netdump on the client that do not start with "# " are NETDUMPADDR=aaa.bbb.ccc.ddd SYSLOGADDR=aaa.bbb.ccc.ddd where aaa.bbb.ccc.ddd are actually the IP of the netdump-server, which is a fully updated RHEL4 clone running 2.6.9-22.0.1.EL. I realize that the SYSLOGADDR does nothing useful, as syslogd on the netdump-server is not running with -r. However, the netlog portion of netdump does work. Could it be because the server has only 14G free on the partition which has /var/crash, whereas the client has 32GB of memory? So, before I can provide a vmcore I need help with that. Thanks.
> Could it be because the server has only 14G free on the partition
> which has /var/crash, whereas the client has 32GB of memory?
That is correct -- before attempting to create the vmcore file
the netdump-server does a space check is made on that partition.
If that check fails (or any of serveral other error conditions
occur), the vmcore creation is suspended, an error message is
logged, and the dumpfile is left named "vmcore-incomplete".
If you look in the /var/log/messages file on the netdump-server,
you will find the reason for failure. Although there are several
error possibilities, you'll probably see a "No space for dump image"
message generated by the netdump-server daemon.
BTW, if this can be reproduced reliably with a smaller amount
of memory, you could try booting the kernel with the "mem=Xg"
command line argument, where X is the number of gigabytes you'd
like to restrict the kernel to use. Perhaps "mem=8g" might be
be sufficient to reproduce the bug? That would reduce the
possibility of any other vmcore-creation failure and make it
it a hell of lot easier to ship the vmcore around.
Ok, I found the netdump-server lines in the server's /var/log/messages which indicated that the space was insufficient. I moved /var/crash to an empty partition so that should no longer be a problem. We will try to reproduce the problem tomorrow morning, with mem=8G . I have a 8GB vmcore (vmcore.bz2 is only 1.1GB), and the "log", from the 2.6.9-42.ELlargesmp kernel from Jason's website. Where can I ftp the files? Any chance you have a publically-available web site that it can be downloaded from (at least temporarily)? Normally vmcore transitions go through our support engineering ftp site, but that might require that this bugzilla be escalated via your support contract. They will open an issue tracker, and handle all the details: https://www.redhat.com/support/service/ or: Red Hat Global Support Services at 1-888-RED-HAT1 I also now got a 4GB vmcore (.bz2 is only 441805515 bytes) from netdump, and its "log". This is from the 2.6.9-34.0.2 kernel (CentOS-compiled). One more datapoint: we were unable to crash the 2.6.17.7 kernel , so it does appear to be a 2.6.9 bug. I put the files into http://strucbio.biologie.uni-konstanz.de/~kay/crashdumps.1 and crashdumps.2 , respectively . I'll remove the directories http://strucbio.biologie.uni-konstanz.de/~kay/crashdumps.1 (2.6.9-42 kernel, 8GB memory) and http://strucbio.biologie.uni-konstanz.de/~kay/crashdumps.2 (2.6.9-34.0.2 kernel, 4 GB memory) tomorrow. Looking forward to a fix (other than 2.6.17.7)! after looking at this a bit, i was wondering if there is just a lot of lock contention, and that disabling the nmi_watchdog by defuault would make the panic go away....thus, can you please try booting with: "nmi_watchdog=0" at the kernel command line and see if you can still reproduce the crash. thanks. I disabled the nmi_watchdog and indeed, since then, the Java user has not been able to trigger the problem! It is maybe a bit early to be entirely sure but according to our experience so far this is a significant progress. A couple of questions arise: a) Is there a difference between the nmi_watchdog in the vanilla kernel (which survived the "Java test") and the RHEL-2.6.9-42 one (which panic-ed)? b) Should I use "hangcheck_timer" instead (this is what Oracle - which is not used on this machine - appears to recommend)? c) Should I enable the "Watchdog Timer" in the BIOS of the machine? d) could the 5-second timeout of the nmi_watchdog be prolonged to make it useful again on this machine? e) are there any other recommendations or known caveats for a 16core/32G machine? Anyway, thanks a lot for finding the bit that appears to have caused the trouble! ok, thanks for the feedback. However, despite the fact that there is no panic there still is a real problem here. The fact that the nmi watchdog can trigger b/c a processor has interrupts disabled for 5+ seconds is unacceptable. This may however, be a hardware issue in that all processor aren't being treated fairly....that would require a bit more investigation...I'm curious if there are still 'lost timer' messages in /var/log/messages...and also if there are any noticeable problems that you have observed. a) i don't think the difference in the nmi_watchdog is the cause this problem, but they are likely to be a bit different since the source base is different. b) 'hangcheck_timer' works by setting kernel timers and making sure they fire within a certain interval, it does try to detect lockups but works quite differently from the nmi watchdog... c) i guess that's a h/w watchdog, don't know much about it... d) that's a thought, perhaps the timeout needs to scale with the number of processors e) this the first time i've seen this problem on rhel4 x86_64, however i have heard of similar issues with Numa memory systems where memory access in not uniform and thus different processors don't share locks fairly. when you say "This may however, be a hardware issue in that all processor aren't being treated fairly" could it have something to do with the "flat" APIC model (I don't understand the technical details) which 2.6.17 has and which 2.6.9 doesn't? possibly relevant: http://www.x86-64.org/lists/discuss/msg08255.html I'm not sure that the mailing-list discussion is very enlightening. First, the topology of CPU connections in 8- (see ftp://ftp.tyan.com/datasheets/d_s4881_104.pdf ) and 4-way servers is different as a Opteron 8xx has only 3 HT channels. It is clear that in a 8-way server CPUs are up to 3 hops away (on average 1.786), whereas in a 4-way that would be 1 or 2 hops (on average 1.333). Benchmark results will thus depend a lot on which CPUs are involved. Second, the theoretical maximum speed of DDR400 memory is 6.4GB/s; Hypertransport (HT1000) between the CPUs will be limiting this to 2GB/s for each link, and I wouldn't expect that one can simply add up all the maximum numbers for all CPUs, unless a lot of work is devoted into the setup of the benchmark experiment. Third, the "benchmark" numbers do not depend on the kernel version. Possibly not very relevant for this bug: I've started to worry a bit about whether running irqbalance is a good idea on this server, as only cores 0-3 should handle interrupts for disks and network. /proc/interrupts shows that currently also some of the CPUs with high numbers are engaged in interrupt handling, and some CPUs (1,5,7,9,11-15) not at all. Should I switch to "ONESHOT=yes" in /etc/syconfig/irqbalance, or use a different method? Concerning my question having to do with the "flat APIC" - this was referring to Bugzilla 192760. Concerning the kernel warning "many lost ticks": this appears usually after the kernel boots, and never comes again so I think it is harmless. Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue. |