Bug 117210

Summary: Red Hat Enterprise Linux 3.0 - QU1 freezes when running a jni-thread test case with JRockit JVM.
Product: Red Hat Enterprise Linux 3 Reporter: Stefan Särne <stefan.sarne>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: johan.walles, k.georgiou, mingo, petrides, stefan.sarne, yasun
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-19 19:29:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
This is a small reproducer that DO NOT reproduce the issue.
none
Reproducer for the problem to use with jrockit.
none
A log with sysrq+t and a few sysrq+m and sysrq+w
none
sysrq + t m m m m w w w p p p p none

Description Stefan Särne 2004-03-01 17:21:29 UTC
There is a java test case run with the JRockit JVM under testing, that
freezes the complete system (OS) on some machines. The system will not
return and only responds with very limited actions. For instance, you
can switch login session and keys typed as user name are echoed. But
it never moves on to password reading.

This problem is fairly reproducible on some machines. On one machine
it happens every time. On another it happens only occasionally.
In all I have seen it happen on three machines. What they all have
in common is that they are dual P4 machines and the kernel is booted
in smp-mode.

Version:
Linux version 2.4.21-9.ELsmp (bhcompile.redhat.com)
(gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-26)) #1 SMP Thu Jan 8 
17:08:56 EST 2004

JRockit version:
The RHEL3 beta which you have in house (Ariane sp3 load 5).
I ran with a debug version, let me know how and I can give it to you.

This was also seen before upgrading to QU1, but it still remained ant
that is why I refer to QU1. It is also what we plan to support as far
as I know with our next release.

The test case starts 100 threads and then does some native calls
using JNI. The behavior seems to be that the first 30-40 threads
start without problem, then it slows down and around 60 threads it
freezes.  A successful run starts the threads and join on a few seconds.
In this case the OS freezes as mentioned and according to the
System Manager it consumes 100% of the CPU.

> What consumes 100% of the CPU? (Include the answer to this in the
> bugzilla report).

Sorry, but I cannot really tell. When this occurs, the system is no
longer responding. I have run with strace, but then it no longer
occur. I assume it is some kind of timing things that no longer occur
given this over head. I will try to see if I can give some more info
on this. I will add the info to the bugzilla in that case.

---

> Can I help you with more info, or send you a test case to reproduce?

A simple c-reproducer is written but it DO NOT reproduce this
behaviour. I will still attach it. 

I will also attach a java test case, which reproduces the behaviour
with JRockit. It is very hard to debug this issue from the upper
layers, since it is hard to do anything when the system freezes. I
hope even though the testcase will be quite large, the behaviour will
be rather quick and deterministic so you will be able to debug from
seeing what happens below.

Comment 1 Stefan Särne 2004-03-01 17:26:18 UTC
Created attachment 98165 [details]
This is a small reproducer that DO NOT reproduce the issue.

Comment 2 Thomas Fitzsimmons 2004-03-01 17:52:55 UTC
Can you run top in another window and find out what process is
dominating the CPU?


Comment 3 Stefan Särne 2004-03-01 18:51:18 UTC
Top shows:
 cpu00  99.6%  ...
 cpu01  99.8%  ...

%CPU  %MEM Command
49,9   7,4  java
1,2    2,0  X
0,1    2,9  gnome-terminal
0,0    0,0  ...
...


Comment 4 Stefan Särne 2004-03-03 14:59:30 UTC
Created attachment 98240 [details]
Reproducer for the problem to use with jrockit.

Include library file, class files and source code.
The file compile.sh can be used to compile and have an example on how to run.

Comment 6 Thomas Fitzsimmons 2004-03-18 17:18:08 UTC
Posting this here.  These discussions should take place in Bugzilla.

From: 	Arvind Jain <ajain>
To: 	Thomas Fitzsimmons <fitzsim>
Cc: 	Stefan Sarne <stefan>, Georges Saab <gsaab>
Subject: 	Status on open bug?
Date: 	Wed, 17 Mar 2004 14:53:01 -0800	
Hi Tom,
 
Any updates on this bug -
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=117210?  Will it
be fixed with Update 2 of RHEL 3.0?
 
Thanks,
Arvind


Comment 7 Thomas Fitzsimmons 2004-03-18 17:19:01 UTC
From: 	Thomas Fitzsimmons <fitzsim>
To: 	Arvind Jain <ajain>
Cc: 	Stefan Sarne <stefan>, Georges Saab <gsaab>,
external-bea-java
Subject: 	Re: Status on open bug?
Date: 	Wed, 17 Mar 2004 19:24:17 -0500	
On Wed, 2004-03-17 at 17:53, Arvind Jain wrote:
> Hi Tom,
>  
> Any updates on this bug -
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=117210?  Will it
> be fixed with Update 2 of RHEL 3.0?

I'll check with the engineer to whom this is assigned.  However, I
should state again that the chances of this being fixed are much higher
if you can reproduce the problem with a simple C test case.  Remember
that we absolutely cannot debug any aspect of the JVM -- it is a black
box to us -- so we'd really like to be convinced that this is a problem
in the OS and not the JVM itself.

Tom

Comment 8 Thomas Fitzsimmons 2004-03-18 17:20:56 UTC
From: 	Johan Walles <johan.walles>
To: 	Thomas Fitzsimmons <fitzsim>
Cc: 	Arvind Jain <ajain>, Stefan Sarne <stefan>,
Georges Saab <gsaab>, external-bea-java
Subject: 	Re: Status on open bug?
Date: 	Thu, 18 Mar 2004 08:34:40 +0100

I understand why you are reluctant to use the JVM as a test case, but
the reason 
*I* think it should be enough (in case it actually reproduces the
problem for 
you) is that it makes the whole machine become unresponsive.

Since no user-space app, especially one run as non-root, should ever
be able to 
mess up a machine that way, this should (at least) be an OS problem of
some kind.

The JVM may of course have problems as well, and they may be related
to this, 
but as long as the OS below us goes bye-bye every time we run, the OS is 
*definitely* doing something wrong (as well).

   Regards //Johan

Comment 9 Rik van Riel 2004-03-18 23:37:36 UTC
If it is indeed a kernel problem (quite likely at this point), then we
should be able to get some more information from the sysrq reporting
facilities.

Could you please reproduce the bug on a system with serial console and
send us the output of sysrq+t, sysrq+m and sysrq+s or sysrq+w ?

Possibly multiple captures of sysrq+m and sysrq+w

After that we should be able to tell more...

Comment 10 Stefan Särne 2004-03-19 12:47:16 UTC
Created attachment 98676 [details]
A log with sysrq+t and a few sysrq+m and sysrq+w

Attaching requested log.

/Stefan

Comment 12 Rik van Riel 2004-03-24 23:03:38 UTC
OK, here are the two currently running tasks.  The second one doesn't
look too interesting now, except if it's holding a lock needed by the
second one.  I'll comb through the other call traces to figure out who
is holding which lock, and exactly what lock do_fork() is waiting for
... if it's waiting.

java          R 85E63B08     0  3357   3288  3358             (NOTLB)
Call Trace:   [<c0126f5e>] do_fork [kernel] 0x4e (0xf246ff68)
[<c014b9ff>] sys_mprotect [kernel] 0x16f (0xf246ff8c)
[<c0109d09>] sys_clone [kernel] 0x49 (0xf246ffa0)

java          R current   5024  3374   3357        3375  3373 (NOTLB)
Call Trace:   [<c01c1236>] receive_chars [kernel] 0x1d6 (0xf11e7f1c)
[<c01c17fa>] rs_interrupt_single [kernel] 0x12a (0xf11e7f4c)
[<c010d879>] handle_IRQ_event [kernel] 0x69 (0xf11e7f78)
[<c010dab9>] do_IRQ [kernel] 0xb9 (0xf11e7f98)
[<c010da00>] do_IRQ [kernel] 0x0 (0xf11e7fbc)

Unfortunately there wasn't anything useful in the sysrq+w output.
Could you please try sysrq+p ?

Comment 13 Stefan Särne 2004-04-01 08:03:53 UTC
Created attachment 99028 [details]
sysrq + t m m m m w w w p p p p

Comment 15 Thomas Fitzsimmons 2006-06-30 20:19:01 UTC
Didn't mean to close this one, reopening.


Comment 16 Larry Woodman 2006-12-01 20:19:23 UTC
Can someone determind if this is still a problem in the latest RHEL3 kernel? 
This bug is very old and there have been lots of fixes and updates included in
the kernel since RHEL3-U1.

Larry Woodman


Comment 17 RHEL Program Management 2007-10-19 19:29:36 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.