117210 – Red Hat Enterprise Linux 3.0 - QU1 freezes when running a jni-thread test case with JRockit JVM.

Bug 117210 - Red Hat Enterprise Linux 3.0 - QU1 freezes when running a jni-thread test case with JRockit JVM.

Summary: Red Hat Enterprise Linux 3.0 - QU1 freezes when running a jni-thread test ca...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-03-01 17:21 UTC by Stefan Särne
Modified:	2008-08-02 23:40 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 19:29:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
This is a small reproducer that DO NOT reproduce the issue. (1.16 KB, text/plain) 2004-03-01 17:26 UTC, Stefan Särne	no flags	Details
Reproducer for the problem to use with jrockit. (12.85 KB, application/octet-stream) 2004-03-03 14:59 UTC, Stefan Särne	no flags	Details
A log with sysrq+t and a few sysrq+m and sysrq+w (72.26 KB, text/plain) 2004-03-19 12:47 UTC, Stefan Särne	no flags	Details
sysrq + t m m m m w w w p p p p (74.72 KB, text/plain) 2004-04-01 08:03 UTC, Stefan Särne	no flags	Details
View All

Description Stefan Särne 2004-03-01 17:21:29 UTC

There is a java test case run with the JRockit JVM under testing, that
freezes the complete system (OS) on some machines. The system will not
return and only responds with very limited actions. For instance, you
can switch login session and keys typed as user name are echoed. But
it never moves on to password reading.

This problem is fairly reproducible on some machines. On one machine
it happens every time. On another it happens only occasionally.
In all I have seen it happen on three machines. What they all have
in common is that they are dual P4 machines and the kernel is booted
in smp-mode.

Version:
Linux version 2.4.21-9.ELsmp (bhcompile.redhat.com)
(gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-26)) #1 SMP Thu Jan 8 
17:08:56 EST 2004

JRockit version:
The RHEL3 beta which you have in house (Ariane sp3 load 5).
I ran with a debug version, let me know how and I can give it to you.

This was also seen before upgrading to QU1, but it still remained ant
that is why I refer to QU1. It is also what we plan to support as far
as I know with our next release.

The test case starts 100 threads and then does some native calls
using JNI. The behavior seems to be that the first 30-40 threads
start without problem, then it slows down and around 60 threads it
freezes.  A successful run starts the threads and join on a few seconds.
In this case the OS freezes as mentioned and according to the
System Manager it consumes 100% of the CPU.

> What consumes 100% of the CPU? (Include the answer to this in the
> bugzilla report).

Sorry, but I cannot really tell. When this occurs, the system is no
longer responding. I have run with strace, but then it no longer
occur. I assume it is some kind of timing things that no longer occur
given this over head. I will try to see if I can give some more info
on this. I will add the info to the bugzilla in that case.

---

> Can I help you with more info, or send you a test case to reproduce?

A simple c-reproducer is written but it DO NOT reproduce this
behaviour. I will still attach it. 

I will also attach a java test case, which reproduces the behaviour
with JRockit. It is very hard to debug this issue from the upper
layers, since it is hard to do anything when the system freezes. I
hope even though the testcase will be quite large, the behaviour will
be rather quick and deterministic so you will be able to debug from
seeing what happens below.

Comment 1 Stefan Särne 2004-03-01 17:26:18 UTC

Created attachment 98165 [details]
This is a small reproducer that DO NOT reproduce the issue.

Comment 2 Thomas Fitzsimmons 2004-03-01 17:52:55 UTC

Can you run top in another window and find out what process is
dominating the CPU?

Comment 3 Stefan Särne 2004-03-01 18:51:18 UTC

Top shows:
 cpu00  99.6%  ...
 cpu01  99.8%  ...

%CPU  %MEM Command
49,9   7,4  java
1,2    2,0  X
0,1    2,9  gnome-terminal
0,0    0,0  ...
...

Comment 4 Stefan Särne 2004-03-03 14:59:30 UTC

Created attachment 98240 [details]
Reproducer for the problem to use with jrockit.

Include library file, class files and source code.
The file compile.sh can be used to compile and have an example on how to run.

Comment 6 Thomas Fitzsimmons 2004-03-18 17:18:08 UTC

Posting this here.  These discussions should take place in Bugzilla.

From: 	Arvind Jain <ajain>
To: 	Thomas Fitzsimmons <fitzsim>
Cc: 	Stefan Sarne <stefan>, Georges Saab <gsaab>
Subject: 	Status on open bug?
Date: 	Wed, 17 Mar 2004 14:53:01 -0800	
Hi Tom,

Any updates on this bug -
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=117210?  Will it
be fixed with Update 2 of RHEL 3.0?

Thanks,
Arvind

Comment 7 Thomas Fitzsimmons 2004-03-18 17:19:01 UTC

From: 	Thomas Fitzsimmons <fitzsim>
To: 	Arvind Jain <ajain>
Cc: 	Stefan Sarne <stefan>, Georges Saab <gsaab>,
external-bea-java
Subject: 	Re: Status on open bug?
Date: 	Wed, 17 Mar 2004 19:24:17 -0500	
On Wed, 2004-03-17 at 17:53, Arvind Jain wrote:
> Hi Tom,
>  
> Any updates on this bug -
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=117210?  Will it
> be fixed with Update 2 of RHEL 3.0?

I'll check with the engineer to whom this is assigned.  However, I
should state again that the chances of this being fixed are much higher
if you can reproduce the problem with a simple C test case.  Remember
that we absolutely cannot debug any aspect of the JVM -- it is a black
box to us -- so we'd really like to be convinced that this is a problem
in the OS and not the JVM itself.

Tom

Comment 8 Thomas Fitzsimmons 2004-03-18 17:20:56 UTC

From: 	Johan Walles <johan.walles>
To: 	Thomas Fitzsimmons <fitzsim>
Cc: 	Arvind Jain <ajain>, Stefan Sarne <stefan>,
Georges Saab <gsaab>, external-bea-java
Subject: 	Re: Status on open bug?
Date: 	Thu, 18 Mar 2004 08:34:40 +0100

I understand why you are reluctant to use the JVM as a test case, but
the reason 
*I* think it should be enough (in case it actually reproduces the
problem for 
you) is that it makes the whole machine become unresponsive.

Since no user-space app, especially one run as non-root, should ever
be able to 
mess up a machine that way, this should (at least) be an OS problem of
some kind.

The JVM may of course have problems as well, and they may be related
to this, 
but as long as the OS below us goes bye-bye every time we run, the OS is 
*definitely* doing something wrong (as well).

   Regards //Johan

Comment 9 Rik van Riel 2004-03-18 23:37:36 UTC

If it is indeed a kernel problem (quite likely at this point), then we
should be able to get some more information from the sysrq reporting
facilities.

Could you please reproduce the bug on a system with serial console and
send us the output of sysrq+t, sysrq+m and sysrq+s or sysrq+w ?

Possibly multiple captures of sysrq+m and sysrq+w

After that we should be able to tell more...

Comment 10 Stefan Särne 2004-03-19 12:47:16 UTC

Created attachment 98676 [details]
A log with sysrq+t and a few sysrq+m and sysrq+w

Attaching requested log.

/Stefan

Comment 12 Rik van Riel 2004-03-24 23:03:38 UTC

OK, here are the two currently running tasks.  The second one doesn't
look too interesting now, except if it's holding a lock needed by the
second one.  I'll comb through the other call traces to figure out who
is holding which lock, and exactly what lock do_fork() is waiting for
... if it's waiting.

java          R 85E63B08     0  3357   3288  3358             (NOTLB)
Call Trace:   [<c0126f5e>] do_fork [kernel] 0x4e (0xf246ff68)
[<c014b9ff>] sys_mprotect [kernel] 0x16f (0xf246ff8c)
[<c0109d09>] sys_clone [kernel] 0x49 (0xf246ffa0)

java          R current   5024  3374   3357        3375  3373 (NOTLB)
Call Trace:   [<c01c1236>] receive_chars [kernel] 0x1d6 (0xf11e7f1c)
[<c01c17fa>] rs_interrupt_single [kernel] 0x12a (0xf11e7f4c)
[<c010d879>] handle_IRQ_event [kernel] 0x69 (0xf11e7f78)
[<c010dab9>] do_IRQ [kernel] 0xb9 (0xf11e7f98)
[<c010da00>] do_IRQ [kernel] 0x0 (0xf11e7fbc)

Unfortunately there wasn't anything useful in the sysrq+w output.
Could you please try sysrq+p ?

Comment 13 Stefan Särne 2004-04-01 08:03:53 UTC

Created attachment 99028 [details]
sysrq + t m m m m w w w p p p p

Comment 15 Thomas Fitzsimmons 2006-06-30 20:19:01 UTC

Didn't mean to close this one, reopening.

Comment 16 Larry Woodman 2006-12-01 20:19:23 UTC

Can someone determind if this is still a problem in the latest RHEL3 kernel? 
This bug is very old and there have been lots of fixes and updates included in
the kernel since RHEL3-U1.

Larry Woodman

Comment 17 RHEL Program Management 2007-10-19 19:29:36 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.