There is a java test case run with the JRockit JVM under testing, that freezes the complete system (OS) on some machines. The system will not return and only responds with very limited actions. For instance, you can switch login session and keys typed as user name are echoed. But it never moves on to password reading. This problem is fairly reproducible on some machines. On one machine it happens every time. On another it happens only occasionally. In all I have seen it happen on three machines. What they all have in common is that they are dual P4 machines and the kernel is booted in smp-mode. Version: Linux version 2.4.21-9.ELsmp (bhcompile.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-26)) #1 SMP Thu Jan 8 17:08:56 EST 2004 JRockit version: The RHEL3 beta which you have in house (Ariane sp3 load 5). I ran with a debug version, let me know how and I can give it to you. This was also seen before upgrading to QU1, but it still remained ant that is why I refer to QU1. It is also what we plan to support as far as I know with our next release. The test case starts 100 threads and then does some native calls using JNI. The behavior seems to be that the first 30-40 threads start without problem, then it slows down and around 60 threads it freezes. A successful run starts the threads and join on a few seconds. In this case the OS freezes as mentioned and according to the System Manager it consumes 100% of the CPU. > What consumes 100% of the CPU? (Include the answer to this in the > bugzilla report). Sorry, but I cannot really tell. When this occurs, the system is no longer responding. I have run with strace, but then it no longer occur. I assume it is some kind of timing things that no longer occur given this over head. I will try to see if I can give some more info on this. I will add the info to the bugzilla in that case. --- > Can I help you with more info, or send you a test case to reproduce? A simple c-reproducer is written but it DO NOT reproduce this behaviour. I will still attach it. I will also attach a java test case, which reproduces the behaviour with JRockit. It is very hard to debug this issue from the upper layers, since it is hard to do anything when the system freezes. I hope even though the testcase will be quite large, the behaviour will be rather quick and deterministic so you will be able to debug from seeing what happens below.
Created attachment 98165 [details] This is a small reproducer that DO NOT reproduce the issue.
Can you run top in another window and find out what process is dominating the CPU?
Top shows: cpu00 99.6% ... cpu01 99.8% ... %CPU %MEM Command 49,9 7,4 java 1,2 2,0 X 0,1 2,9 gnome-terminal 0,0 0,0 ... ...
Created attachment 98240 [details] Reproducer for the problem to use with jrockit. Include library file, class files and source code. The file compile.sh can be used to compile and have an example on how to run.
Posting this here. These discussions should take place in Bugzilla. From: Arvind Jain <ajain> To: Thomas Fitzsimmons <fitzsim> Cc: Stefan Sarne <stefan>, Georges Saab <gsaab> Subject: Status on open bug? Date: Wed, 17 Mar 2004 14:53:01 -0800 Hi Tom, Any updates on this bug - https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=117210? Will it be fixed with Update 2 of RHEL 3.0? Thanks, Arvind
From: Thomas Fitzsimmons <fitzsim> To: Arvind Jain <ajain> Cc: Stefan Sarne <stefan>, Georges Saab <gsaab>, external-bea-java Subject: Re: Status on open bug? Date: Wed, 17 Mar 2004 19:24:17 -0500 On Wed, 2004-03-17 at 17:53, Arvind Jain wrote: > Hi Tom, > > Any updates on this bug - > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=117210? Will it > be fixed with Update 2 of RHEL 3.0? I'll check with the engineer to whom this is assigned. However, I should state again that the chances of this being fixed are much higher if you can reproduce the problem with a simple C test case. Remember that we absolutely cannot debug any aspect of the JVM -- it is a black box to us -- so we'd really like to be convinced that this is a problem in the OS and not the JVM itself. Tom
From: Johan Walles <johan.walles> To: Thomas Fitzsimmons <fitzsim> Cc: Arvind Jain <ajain>, Stefan Sarne <stefan>, Georges Saab <gsaab>, external-bea-java Subject: Re: Status on open bug? Date: Thu, 18 Mar 2004 08:34:40 +0100 I understand why you are reluctant to use the JVM as a test case, but the reason *I* think it should be enough (in case it actually reproduces the problem for you) is that it makes the whole machine become unresponsive. Since no user-space app, especially one run as non-root, should ever be able to mess up a machine that way, this should (at least) be an OS problem of some kind. The JVM may of course have problems as well, and they may be related to this, but as long as the OS below us goes bye-bye every time we run, the OS is *definitely* doing something wrong (as well). Regards //Johan
If it is indeed a kernel problem (quite likely at this point), then we should be able to get some more information from the sysrq reporting facilities. Could you please reproduce the bug on a system with serial console and send us the output of sysrq+t, sysrq+m and sysrq+s or sysrq+w ? Possibly multiple captures of sysrq+m and sysrq+w After that we should be able to tell more...
Created attachment 98676 [details] A log with sysrq+t and a few sysrq+m and sysrq+w Attaching requested log. /Stefan
OK, here are the two currently running tasks. The second one doesn't look too interesting now, except if it's holding a lock needed by the second one. I'll comb through the other call traces to figure out who is holding which lock, and exactly what lock do_fork() is waiting for ... if it's waiting. java R 85E63B08 0 3357 3288 3358 (NOTLB) Call Trace: [<c0126f5e>] do_fork [kernel] 0x4e (0xf246ff68) [<c014b9ff>] sys_mprotect [kernel] 0x16f (0xf246ff8c) [<c0109d09>] sys_clone [kernel] 0x49 (0xf246ffa0) java R current 5024 3374 3357 3375 3373 (NOTLB) Call Trace: [<c01c1236>] receive_chars [kernel] 0x1d6 (0xf11e7f1c) [<c01c17fa>] rs_interrupt_single [kernel] 0x12a (0xf11e7f4c) [<c010d879>] handle_IRQ_event [kernel] 0x69 (0xf11e7f78) [<c010dab9>] do_IRQ [kernel] 0xb9 (0xf11e7f98) [<c010da00>] do_IRQ [kernel] 0x0 (0xf11e7fbc) Unfortunately there wasn't anything useful in the sysrq+w output. Could you please try sysrq+p ?
Created attachment 99028 [details] sysrq + t m m m m w w w p p p p
Didn't mean to close this one, reopening.
Can someone determind if this is still a problem in the latest RHEL3 kernel? This bug is very old and there have been lots of fixes and updates included in the kernel since RHEL3-U1. Larry Woodman
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.