200885 – kernel BUG triggered by Java app on x86_64 (32GB; 16-core Tyan VX50)

Bug 200885 - kernel BUG triggered by Java app on x86_64 (32GB; 16-core Tyan VX50)

Summary: kernel BUG triggered by Java app on x86_64 (32GB; 16-core Tyan VX50)

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Bhavna Sarathy
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-08-01 11:28 UTC by Kay Diederichs
Modified:	2012-06-20 13:26 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-06-20 13:26:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
log obtained by netconsole (23.56 KB, application/x-msdownload) 2006-08-01 11:28 UTC, Kay Diederichs	no flags	Details
dmesg output (8.06 KB, application/x-msdownload) 2006-08-01 11:30 UTC, Kay Diederichs	no flags	Details
output of "top" when machines crashes, note the high "sy" fraction (1.10 KB, application/octet-stream) 2006-08-01 11:38 UTC, Kay Diederichs	no flags	Details
output of "vmstat 1" when machine crashes (1.37 KB, application/octet-stream) 2006-08-01 11:39 UTC, Kay Diederichs	no flags	Details
View All

Description Kay Diederichs 2006-08-01 11:28:22 UTC

Description of problem:
A "workflow" of a Java user crashes our server reproducibly. The server, which 
runs at update 3, is otherwise stable over weeks.

Version-Release number of selected component (if applicable):
kernel-largesmp-2.6.9-34.0.2.EL. I also tried kernel-largesmp-2.6.9-42.EL
from J.Baron's website, with the same result.

How reproducible:
always 

Steps to Reproduce:
1. user loads Java app
2.
3.
  
Actual results:
crash as shown in attached log

Expected results:
no crash

Additional info:
dmesg output is at http://strucbio.biologie.uni-konstanz.de/~kay/dmesg
the crashlog is also at http://strucbio.biologie.uni-konstanz.de/~kay/crash2 ,
and a kernel-largesmp-2.6.9-42.EL crashlog is at
http://strucbio.biologie.uni-konstanz.de/~kay/crash (the latter is tainted by
VMware modules, but the crashes are unrelated)

Comment 1 Kay Diederichs 2006-08-01 11:28:23 UTC

Created attachment 133387 [details]
log obtained by netconsole

Comment 2 Kay Diederichs 2006-08-01 11:30:26 UTC

Created attachment 133388 [details]
dmesg output

Comment 3 Kay Diederichs 2006-08-01 11:38:07 UTC

Created attachment 133389 [details]
output of "top" when machines crashes, note the high "sy" fraction

Comment 4 Kay Diederichs 2006-08-01 11:39:44 UTC

Created attachment 133390 [details]
output of "vmstat 1" when machine crashes

Comment 5 Jason Baron 2006-08-01 20:05:22 UTC

looks like cpu 5 is spinningi on the sighand->siglock. The question is who is
holding so long....We could try a debug patch to figure this out...if you can
reliably reproduce it...is it easier for us to give you a debug kernel, or is it
possible that we can get the reproducer?

Comment 6 Kay Diederichs 2006-08-02 07:34:11 UTC

A debug kernel would be fine.

Comment 7 Jason Baron 2006-08-09 19:15:22 UTC

hmmm...can you set up netdump? it looks like you already have netconsole
working. thanks.

Comment 8 Jason Baron 2006-08-09 19:19:28 UTC

this would really help us narrow down the casue possibly without rely on a debug
kernel.

Comment 9 Kay Diederichs 2006-08-10 08:07:41 UTC

my intention had indeed been to capture vmcore on the netdump server. However,
so far only a file
-rw-------  1 netdump netdump      0 Aug  1 12:44 vmcore-incomplete
is written, whereas the "log" indeed is captured ok in the same directory. What
could prevent vmcore to be written? The only lines in /etc/sysconfig/netdump on
the client that do not start with "# " are
NETDUMPADDR=aaa.bbb.ccc.ddd
SYSLOGADDR=aaa.bbb.ccc.ddd
where aaa.bbb.ccc.ddd are actually the IP of the netdump-server, which is a
fully updated RHEL4 clone running 2.6.9-22.0.1.EL. I realize that the SYSLOGADDR
does nothing useful, as syslogd on the netdump-server is not running with -r.
However, the netlog portion of netdump does work. Could it be because the server
has only 14G free on the partition which has /var/crash, whereas the client has
32GB of memory?
So, before I can provide a vmcore I need help with that. Thanks.

Comment 10 Dave Anderson 2006-08-10 12:42:27 UTC

> Could it be because the server has only 14G free on the partition
> which has /var/crash, whereas the client has 32GB of memory?

That is correct -- before attempting to create the vmcore file
the netdump-server does a space check is made on that partition.
If that check fails (or any of serveral other error conditions
occur), the vmcore creation is suspended, an error message is
logged, and the dumpfile is left named "vmcore-incomplete".
If you look in the /var/log/messages file on the netdump-server,
you will find the reason for failure.  Although there are several
error possibilities, you'll probably see a "No space for dump image"
message generated by the netdump-server daemon.

BTW, if this can be reproduced reliably with a smaller amount
of memory, you could try booting the kernel with the "mem=Xg"
command line argument, where X is the number of gigabytes you'd
like to restrict the kernel to use.  Perhaps "mem=8g" might be
be sufficient to reproduce the bug?  That would reduce the 
possibility of any other vmcore-creation failure and make it
it a hell of lot easier to ship the vmcore around.

Comment 11 Kay Diederichs 2006-08-10 13:27:44 UTC

Ok, I found the netdump-server lines in the server's /var/log/messages which
indicated that the space was insufficient. 
I moved /var/crash to an empty partition so that should no longer be a problem.
We will try to reproduce the problem tomorrow morning, with mem=8G .

Comment 12 Kay Diederichs 2006-08-11 11:24:31 UTC

I have a 8GB vmcore (vmcore.bz2 is only 1.1GB), and the "log", from the
2.6.9-42.ELlargesmp kernel from Jason's website. Where can I ftp the files?

Comment 13 Dave Anderson 2006-08-11 12:51:04 UTC

Any chance you have a publically-available web site
that it can be downloaded from (at least temporarily)?

Normally vmcore transitions go through our support engineering
ftp site, but that might require that this bugzilla be
escalated via your support contract.  They will open an
issue tracker, and handle all the details:

  https://www.redhat.com/support/service/

or:

  Red Hat Global Support Services at 1-888-RED-HAT1

Comment 14 Kay Diederichs 2006-08-11 12:57:41 UTC

I also now got a 4GB vmcore (.bz2 is only 441805515 bytes) from netdump, and its
 "log". This is from the 2.6.9-34.0.2 kernel (CentOS-compiled).

One more datapoint: we were unable to crash the 2.6.17.7 kernel , so it does
appear to be a 2.6.9 bug.

I put the files into http://strucbio.biologie.uni-konstanz.de/~kay/crashdumps.1
and crashdumps.2 , respectively .

Comment 15 Kay Diederichs 2006-08-14 09:07:13 UTC

I'll remove the directories
http://strucbio.biologie.uni-konstanz.de/~kay/crashdumps.1 (2.6.9-42 kernel, 8GB
memory) and
http://strucbio.biologie.uni-konstanz.de/~kay/crashdumps.2 (2.6.9-34.0.2 kernel,
4 GB memory) tomorrow.

Looking forward to a fix (other than 2.6.17.7)!

Comment 16 Jason Baron 2006-08-15 16:09:54 UTC

after looking at this a bit, i was wondering if there is just a lot of lock
contention, and that disabling the nmi_watchdog by defuault would make the panic
go away....thus, can you please try booting with: "nmi_watchdog=0" at the kernel
command line and see if you can still reproduce the crash. thanks.

Comment 17 Kay Diederichs 2006-08-16 13:33:20 UTC

I disabled the nmi_watchdog and indeed, since then, the Java user has not been
able to trigger the problem! It is maybe a bit early to be entirely sure but
according to our experience so far this is a significant progress.
A couple of questions arise:
a) Is there a difference between the nmi_watchdog in the vanilla kernel (which
survived the "Java test") and the RHEL-2.6.9-42 one (which panic-ed)?
b) Should I use "hangcheck_timer" instead (this is what Oracle - which is not
used on this machine - appears to recommend)? 
c) Should I enable the "Watchdog Timer" in the BIOS of the machine?
d) could the 5-second timeout of the nmi_watchdog be prolonged to make it useful
again on this machine?
e) are there any other recommendations or known caveats for a 16core/32G machine?
Anyway, thanks a lot for finding the bit that appears to have caused the trouble!

Comment 18 Jason Baron 2006-08-16 16:54:03 UTC

ok, thanks for the feedback. However, despite the fact that there is no panic
there still is a real problem here. The fact that the nmi watchdog can trigger
b/c a processor has interrupts disabled for 5+ seconds is unacceptable. This may
however, be a hardware issue in that all processor aren't being treated
fairly....that would require a bit more investigation...I'm curious if there are
still 'lost timer' messages in /var/log/messages...and also if there are any
noticeable problems that you have observed.

a) i don't think the difference in the nmi_watchdog is the cause this problem,
but they are likely to be a bit different since the source base is different.
b) 'hangcheck_timer' works by setting kernel timers and making sure they fire
within a certain interval, it does try to detect lockups but works quite
differently from the nmi watchdog...
c) i guess that's a h/w watchdog, don't know much about it...
d) that's a thought, perhaps the timeout needs to scale with the number of
processors
e) this the first time i've seen this problem on rhel4 x86_64, however i have
heard of similar issues with Numa memory systems where memory access in not
uniform and thus different processors don't share locks fairly.

Comment 19 Kay Diederichs 2006-08-16 18:00:18 UTC

when you say "This may however, be a hardware issue in that all processor aren't
being treated fairly" could it have something to do with the "flat" APIC model
(I don't understand the technical details) which 2.6.17 has and which 2.6.9 doesn't?

Comment 20 Jason Baron 2006-08-16 19:45:45 UTC

possibly relevant: 
http://www.x86-64.org/lists/discuss/msg08255.html

Comment 21 Kay Diederichs 2006-08-17 11:41:54 UTC

I'm not sure that the mailing-list discussion is very enlightening. First, the
topology of CPU connections in 8- (see 
ftp://ftp.tyan.com/datasheets/d_s4881_104.pdf ) and 4-way servers is different
as a Opteron 8xx has only 3 HT channels. It is clear that in a 8-way server CPUs
are up to 3 hops away (on average 1.786), whereas in a 4-way that would be 1 or
2 hops (on average 1.333). Benchmark results will thus depend a lot on which
CPUs are involved. Second, the theoretical maximum speed of DDR400 memory is
6.4GB/s; Hypertransport (HT1000) between the CPUs will be limiting this to 2GB/s
for each link, and I wouldn't expect that one can simply add up all the maximum
numbers for all CPUs, unless a lot of work is devoted into the setup of the
benchmark experiment. Third, the "benchmark" numbers do not depend on the kernel
version.

Possibly not very relevant for this bug: I've started to worry a bit about
whether running irqbalance is a good idea on this server, as only cores 0-3
should handle interrupts for disks and network. /proc/interrupts shows that
currently also some of the CPUs with high numbers are engaged in interrupt
handling, and some CPUs (1,5,7,9,11-15) not at all. Should I switch to
"ONESHOT=yes" in /etc/syconfig/irqbalance, or use a different method?

Concerning my question having to do with the "flat APIC" - this was referring to
Bugzilla 192760.

Concerning the kernel warning "many lost ticks": this appears usually after the
kernel boots, and never comes again so I think it is harmless.

Comment 23 Jiri Pallich 2012-06-20 13:26:59 UTC

Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.

Note You need to log in before you can comment on or make changes to this bug.