Bug 166292
Summary: | NFS client, NLM, java 1.4.2_08, readlock hang on futex | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | mark meierjohann <mark> |
Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> |
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | andy.fletcher, emmerson.steven, jsquires, mingo, petrides, sarah.e.thompson, sputhenp, steved, supermike, terryj |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-10-19 18:55:47 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 170417, 345301 |
Description
mark meierjohann
2005-08-18 19:57:40 UTC
this may be related to 141282, I'm seeing this problem still in kernel 2.4.21-32.0.1.ELsmp http://seer.support.veritas.com/docs/277241.htm Process 3759 attached - interrupt to quit futex(0xb65a5844, FUTEX_WAIT, 2, NULL and also something similar to the above, on a different server with a java thread (no nfs involved) --- SIGUSR2 (User defined signal 2) @ 0 (0) --- <... rt_sigsuspend resumed> ) = -1 EINTR (Interrupted system call) rt_sigreturn(0x791fe28c) = -1 EINTR (Interrupted system call) rt_sigreturn(0x800) = 1098 --- SIGILL (Illegal instruction) @ 0 (0) --- rt_sigreturn(0x800) = 1098 futex(0x805b174, FUTEX_WAIT, 2, NULL) = 0 futex(0x88b309c, 0x4 /* FUTEX_??? */, 1) = 0 futex(0x805b174, FUTEX_WAKE, 1) = 1 futex(0x805b1d4, FUTEX_WAIT, 2, NULL) = 0 futex(0x805b1d4, FUTEX_WAKE, 1) = 0 --- SIGUSR2 (User defined signal 2) @ 0 (0) --- rt_sigprocmask(SIG_BLOCK, NULL, [QUIT USR2], 8) = 0 rt_sigsuspend([QUIT] <unfinished ...> --- SIGUSR2 (User defined signal 2) @ 0 (0) --- <... rt_sigsuspend resumed> ) = -1 EINTR (Interrupted system call) rt_sigreturn(0x791fe28c) = -1 EINTR (Interrupted system call) rt_sigreturn(0x800) = 1098 --- SIGILL (Illegal instruction) @ 0 (0) --- rt_sigreturn(0x800) = 1098 futex(0x805b174, FUTEX_WAIT, 2, NULL) = 0 futex(0x805b174, FUTEX_WAKE, 1) = 1 futex(0x805b1d4, FUTEX_WAIT, 2, NULL) = 0 futex(0x805b1d4, FUTEX_WAKE, 1) = 0 also please see Bug 153722 and Bug 145021 appears the same issue is in the 2.6 kernel, as the problem has been reported in FC4, I've just seen it on 2.6.12-1.1447_FC4; range of apps now reported with same issue include veritas netbackup, java, up2date, yum, evolution Just found another case where I can reproduce it, this time on the latest kernel from update6 2.4.21-37.ELsmp running the java installer for oracle 9i: /tmp/OraInstall2005-10-04_09-10-14AM/jre/bin/i386/native_threads/java Process 5649 attached - interrupt to quit futex(0xa2b6bbac, FUTEX_WAIT, 1211, NULL "out of the box" install for oracle9i on unmodified rhel3 update6 once again I got around this by reverting to linuxthreads (LD_ASSUME_KERNEL=2.4.19) it works fine when not using nptl. I ran into this issue again with oracle application server 10g (9.0.4) on rhel3u6 2.4.21-37.ELsmp on 3 servers, all default installs of os and oas. Java process hangs on futex_wait; the version shipped with oracle is: java version "1.4.2_02" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_02-b03) Java HotSpot(TM) Client VM (build 1.4.2_02-b03, mixed mode) It happens exactly the same way each time, on 3 separate servers and I can reproduce it at will. We have meanwhile switched to NFS over tcp and are seeing the same problem. This bug seems to be somewhere in the backports of the 2.6 features to the 2.4 kernel (nptl), etc. We will now test a plain vanilla kernel.org kernel and see if that helps us. Since this is a large enterprise customer we will have to re-open the calls we had open, before. Did you get anywhere with the vanilla kernel? I also ran into this problem with Unidata's ldm. I tried 2.4 and 2.6 kernels, every flavor of OS. The upstream ldm server would serve data for awhile, and then the child process' would report as "defunct" and the downstream servers couldn't reconnect, stopping all ldm communication. When doing an strace on the pid I see the futex wait. Sometimes I could reproduce it in a day, other times it would take weeks, but always the same result. I put in the ld_assume_kernel env. variable and it seems much more stable. I also ran into this problem with Unidata's ldm. I tried 2.4 and 2.6 kernels, every flavor of OS. The upstream ldm server would serve data for awhile, and then the child process' would report as "defunct" and the downstream servers couldn't reconnect, stopping all ldm communication. When doing an strace on the pid I see the futex wait. Sometimes I could reproduce it in a day, other times it would take weeks, but always the same result. I put in the ld_assume_kernel env. variable and it seems much more stable. I was seeing the problem with Apache+PHP. I disabled nptl and updated everything to the very latest packages from RHN and this appears to have solved the problem, at least for me and at least for the moment (1 week+ and counting). (In reply to comment #9) I wish to expand on Sarah Thompson's information. The Unidata LDM is coded to to the _XOPEN_SOURCE=500 standard and makes extensive use of read and write locking (via fcntl()) of an mmap()ed file. Fedora is the only known operating-system on which the LDM exhibits this problem. Regards, Steve Emmerson LDM Developer University Corporation for Atmospheric Research I find this problem any time I run any version of java (tried a variety of 1.4.2 32-bit versions as well as 1.5.0_03 64-bit) in cron or batch; strangely, java works fine interactively. This is on AS3 update 6, kernel 2.4.21-37EL on x86_64. For example, to reproduce: jsquires@bobo:~$ batch at> /opt/jdk1.4.2_09/bin/java -version at> <EOT> job 4 at 2005-11-08 14:15 <wait> jsquires@bobo:~$ ps x | grep java 4035 ? SN 0:00 /opt/jdk1.4.2_09/bin/java -version jsquires@bobo:~$ strace -p 4035 Process 4035 attached - interrupt to quit [ Process PID=4035 runs in 32 bit mode. ] futex(0x80a9b84, FUTEX_WAIT, 1, NULL <unfinished ...> Process 4035 detached Using LD_ASSUME_KERNEL=2.4.1 fixes this for me, but it would be nice to find out what the root of the problem is. Mark, Is this still an issue? P. Prarit, I can't speak for Mark, but at least one LDM user has recently encountered this bug. I've given him this URL and asked if I can post his email address. Steve Emmerson Prarit, James M. Pelagatti <jamie.edu> is the one who encountered this bug. You may contact him. He doesn't encounter it often. Here's his "uname -a": Linux llwxldm2 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux --Steve Emmerson This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you. |