Bug 231276 - Java application Hangs for few seconds (~12 sec)
Java application Hangs for few seconds (~12 sec)
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: java-1.5.0-ibm (Show other bugs)
4.0
i386 Linux
medium Severity urgent
: ---
: ---
Assigned To: Thomas Fitzsimmons
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-03-07 04:59 EST by Neeta Gupta
Modified: 2007-11-30 17:07 EST (History)
2 users (show)

See Also:
Fixed In Version: 1.5.0.3-1jpp.3.el4
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-03-15 16:53:13 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
GC log outout for the test (4.32 MB, text/plain)
2007-03-08 04:27 EST, Neeta Gupta
no flags Details
strace output (3.71 MB, application/x-zip-compressed)
2007-03-09 07:48 EST, Neeta Gupta
no flags Details
This file contains filter for mprotect calls from complete strace output. (628.71 KB, application/octet-stream)
2007-03-30 05:27 EDT, Neeta Gupta
no flags Details

  None (edit)
Description Neeta Gupta 2007-03-07 04:59:19 EST
Description of problem:
We are have developed a Java application server. We have noticed that on Linux 
OS our application hangs for few seconds (~12 seconds ). After this  short 
hang period, application continue to work as usual. 

Our application is a multi threaded server application. We noticed that 
randomly the whole java application hangs, i.e. all threads including GC 
thread for several seconds.

We have got GC logging enabled but there is no evidence of GC even taking that 
long.

The same application works fine on Solaris and Windows machines.

Will you please suggest if there are any known issues or any tools available 
to trace down the cause of this hangs issue.

We have tried running gdb and catching bt output below:
-----------------------------------------------------------
0x0097b7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#0  0x0097b7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00be9b16 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
#2  0x00bea15e in pthread_cond_wait@GLIBC_2.0 () from /lib/tls/libpthread.so.0
#3  0xb7a3b8de in os::Linux::safe_cond_wait ()
   from /usr/jdk1.5.0_07/jre/lib/i386/server/libjvm.so
#4  0xb7a235d4 in Monitor::wait ()
   from /usr/jdk1.5.0_07/jre/lib/i386/server/libjvm.so
#5  0xb7adf0fd in Threads::destroy_vm ()
   from /usr/jdk1.5.0_07/jre/lib/i386/server/libjvm.so
#6  0xb78a2abe in jni_DestroyJavaVM ()
   from /usr/jdk1.5.0_07/jre/lib/i386/server/libjvm.so
#7  0x0804960d in ?? ()
#8  0xb7bb13c0 in jni_NativeInterface ()
   from /usr/jdk1.5.0_07/jre/lib/i386/server/libjvm.so
#9  0x0805d424 in ?? ()
#10 0x8d107b02 in ?? ()
#11 0x0805d414 in ?? ()
#12 0xbfffd794 in ?? ()
#13 0x00000fff in ?? ()
#14 0x08059008 in ?? ()
#15 0x00000000 in ?? ()
----------------------------------------------------------- 

We have tried running application with
LD_ASSUME_KERNEL=2.4.1
and
LD_ASSUME_KERNEL=2.4.19

We found that with, LD_ASSUME_KERNEL=2.4.1 the hang issue occur less 
frequently but eventually we hit this problem.


We have noticed following links mentioning an issue, with Java application 
hangs on some Linux kernels.

http://evanjones.ca/java-linux.html
http://www.mail-archive.com/java-linux@java.blackdown.org/msg15589.html
http://www-306.ibm.com/software/data/db2/linux/validate/rhel3faq.html
http://forum.java.sun.com/thread.jspa?threadID=580350            
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4970650

Will you please let us know,
- if it is a known issue 
- and and if it has been fixed then in which release?
- is there a work around?

Please reply as soon as possible, we provide a product wiith high 
availability, which means applition not responding for more >10 seconds is not 
expectable to our customers.
 

Version-Release number of selected component (if applicable):
Linux 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 EDT 2006 i686 i686 i386 
GNU/Linux

How reproducible:
With our SIP A/S application it can be reproduced in ~1 Hrs, with following GC 
settings:

/usr/jdk1.5.0_07/bin/java -server -XX:NewSize=32m -XX:MaxNewSize=32m -
XX:+DisableExplicitGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -
Xloggc:log_gc.log -Xms512m -Xmx512m 

Steps to Reproduce:
At the moment I do not have a small test application to help you with 
replicating this issue. We have noticed this issue while testing our 
commercial application.

please let us know, if I can provide any further detail on this issue, we 
might also be able to provide you access to our application.
  
Actual results:
Java process hangs for ~12 seconds

Expected results:
Java application should not hang for no aparent reason.


Additional info:
Linux gbcdff1273m.eu.ubiquity.net 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:27:17 
EDT 2006 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux ES release 4 (Nahant Update 3)
Comment 1 Neeta Gupta 2007-03-08 04:24:50 EST
I would also like to mention that we mintored GC logs closely and have noticed 
for the pause duration (~12 sec), there was no GC activity being logged. 
Whereas with the GC setting we were using we were expecting some GC activity 
going almost every second.

Below is the extract from GC log for the duration we noticed pause, the 
complete GC log is attached:


33290.753: [GC 33290.753: [ParNew: 32619K->0K(32704K), 0.0559830 secs] 305364K-
>275688K(524224K), 0.0564280 secs]
33302.175: [GC 33302.176: [ParNew: 32640K->0K(32704K), 0.0793950 secs] 308328K-
>278783K(524224K), 0.0804620 secs]
Comment 2 Neeta Gupta 2007-03-08 04:28:04 EST
Created attachment 149556 [details]
GC log outout for the test

The attache logs contains GC output, you will see that GC activity is being
logged almost every second, but between time stamp 3290.753 and 33302.175 there
is nothing logged, which make us think that even GC thread is not being
executed when application Hang.
Comment 3 Neeta Gupta 2007-03-09 07:48:49 EST
Created attachment 149678 [details]
strace output

I also run strace, the total output is > 1 GB.

However attached is the output from when my script identified that java process
is not responding for last 4 seconds.

I used following command from the script:
 
strace -p$PID -f -ttt -o strace.log

I have no experience in parsing strace file, but I am concern that following
line apparse to be the last line for process 23491.

The java command was running with 23490 PID, and looking at other lines for
process 23490, I assume that 23491 is immediate child thread of process 23490.

In the whole 1G output, I do not see “futex resumed” call since following
line printed. 

If required, I can provide you access to FTP server to obtain the give you
remaining output of strace command. 

23491 1173433909.823682 futex(0x900f9c0c, FUTEX_WAIT, 1, NULL <unfinished ...>
Comment 4 Neeta Gupta 2007-03-12 18:02:20 EDT
As we continue to test our application on diferrent RedHat versions, we 
noticed that our RHEL 4 test machine was using Using Update 3. 

Whenn we upgraded our box for Update 4, we noticed that test is runnign with 
no hang issue for last three days..

We are running other tests in parllel to verify our results, but will you 
please advise us on changes done in Update 4, what exactly do you think should 
have helped us in the issue we were seeing.



 
Comment 5 Neeta Gupta 2007-03-15 16:07:14 EDT
Hi,

As I mentioned earlier, we have noticed that the test we run on RHEL AS4 
Update 4 is not showing this issue.


As we are working closely with our client, we need to understand the actual 
cause and we need to be sure that on what specific platform issue does not 
exist.

 
From experience of other Java/Redhat users encountering same problem, would 
you recommend RHEL AS4 Update 4 or any other LINUX installation as a confirm 
solution? 


PLEASE COMMENT ASAP.

Thanks,
Neeta



Comment 6 Thomas Fitzsimmons 2007-03-15 16:53:13 EDT
(In reply to comment #5)
> Hi,
> 
> As I mentioned earlier, we have noticed that the test we run on RHEL AS4 
> Update 4 is not showing this issue.
> 
> 
> As we are working closely with our client, we need to understand the actual 
> cause and we need to be sure that on what specific platform issue does not 
> exist.
> 
>  
> From experience of other Java/Redhat users encountering same problem, would 
> you recommend RHEL AS4 Update 4 or any other LINUX installation as a confirm 
> solution? 

Yes, use the latest java-1.5.0-ibm packages from RHN.
Comment 7 Thomas Fitzsimmons 2007-03-15 17:01:45 EDT
(In reply to comment #6)
> (In reply to comment #5)
> > From experience of other Java/Redhat users encountering same problem, would 
> > you recommend RHEL AS4 Update 4 or any other LINUX installation as a confirm 
> > solution? 
> 
> Yes, use the latest java-1.5.0-ibm packages from RHN.
> 

And ensure that your system is up-to-date.  The fact that this was fixed by
updating from RHEL-4.3 to RHEL-4.4 suggests that this was not caused by the JVM
alone, but by some interaction between it and the underlying system.

I'd suggest re-testing your application against RHEL-4.5 Beta, and re-opening
this bug if the problem re-appears.
Comment 8 Shawn OReilly 2007-03-16 08:44:37 EDT
Hi Thomas,

For the recommandation to use java-1.5.0-ibm from the RHN; is there a known
problem (or previous bug) with the Sun JVM?

Cheers,

Shawn
Comment 9 Shawn OReilly 2007-03-16 08:45:03 EDT
Hi Thomas,

For the recommendation to use java-1.5.0-ibm from the RHN; is there a known
problem (or previous bug) with the Sun JVM?

Cheers,

Shawn
Comment 10 Neeta Gupta 2007-03-16 09:41:34 EDT
I apologise for the incorrect component associated with this bug, at the time 
I raised that defect, I did not know what component to select, and bugzilla 
didn't allow me to raise a defect without component. Hence I end up selecting 
IBM java.

We are not using IBM java, we are using SUN JDK (build 1.5.0_07-b03).

Comment 11 Thomas Fitzsimmons 2007-03-16 10:28:16 EDT
(In reply to comment #9)
> Hi Thomas,
> 
> For the recommendation to use java-1.5.0-ibm from the RHN; is there a known
> problem (or previous bug) with the Sun JVM?

I'm not aware of one.  We currently only ship IBM and BEA's JVMs on the RHEL
supplementary discs.
Comment 12 Russell Doty 2007-03-23 12:26:02 EDT
Neeta,

Unfortunately, we do not have a support agreement in place with Sun for their
JVM. This makes it distinctly challenging to resolve issues like this one.

We do have support agreements in place with IBM and BEA for their JVMs. This
gives us a mechanism for addressing problems and for escalating problems if they
prove to be in the JVM.

We would really need a reproducer to determine what the underlying problem is.
Since one of our recommendations is to see if the problem exists on the latest
shipping version of RHEL, it sounds like you are already making progress.

Is there an issue with using RHEL 4.4? Based on your problem description, it
sounds like stress testing on RHEL 4.4 would be the logical next step.
Comment 13 Neeta Gupta 2007-03-30 05:23:33 EDT
Hi,

Thanks for your recommendation,

We did see failures with RHEL 4.4 on multiprocessor machines. However it seems 
to be more stable on single processor with HT enabled.

While trying to narrow down the cause of the issue I have filter the 
strace.log for all messages for "mprotect" call. I noticed following block of 
messages, where an mprotect call is being made at '1173434782.998688' but not 
ressumed until at least 1173434791.370265, this is an ~8 seconds gape. Would 
you think that cause java hang issue?

Attached is the file containing complete file filtering mprotact calls from 
the strace output.


23494 1173434782.927869 <... mprotect resumed> ) = 0
23494 1173434782.998688 mprotect(0xb7f84000, 4096, PROT_READ) = 0
23494 1173434785.680847 mprotect(0xb7f83000, 4096, PROT_READ) = 0
23494 1173434785.680966 mprotect(0xb7f83000, 4096, 
PROT_READ|PROT_WRITE|PROT_EXEC) = 0
23494 1173434785.681088 mprotect(0xb7f84000, 4096, PROT_NONE) = 0
23494 1173434785.736099 mprotect(0xb7f84000, 4096, PROT_READ) = 0
23494 1173434788.649178 mprotect(0xb7f83000, 4096, PROT_READ) = 0
23494 1173434788.649297 mprotect(0xb7f83000, 4096, 
PROT_READ|PROT_WRITE|PROT_EXEC) = 0
23494 1173434788.649423 mprotect(0xb7f84000, 4096, PROT_NONE) = 0
23494 1173434788.706982 mprotect(0xb7f84000, 4096, PROT_READ) = 0
23494 1173434791.370106 mprotect(0xb7f83000, 4096, PROT_READ <unfinished ...>
23494 1173434791.370265 <... mprotect resumed> ) = 0



Regards,
Neeta




Comment 14 Neeta Gupta 2007-03-30 05:27:34 EDT
Created attachment 151261 [details]
This file contains filter for mprotect calls from complete strace output.

Note You need to log in before you can comment on or make changes to this bug.