From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050721 CentOS/1.0.6-1.4.1.centos4 Firefox/1.0.6 Description of problem: A java based program heavily using file locking over NFS (NFS Server is RH AS 2.1, Veritas Cluster 2.2) hangs regularly (not allways) during nfs read and/or write locks. The NFS Client (udp) is running RedHat EL3, kernel-smp-2.4.21-32.0.1.EL nfs-utils-1.0.6-33EL, iptables (no rejects in log, all rejects logged), SUN j2re1.4.2_08 (reproducable with IBM JRE, from RedHat EL Extras), bonding-1.0.4t-1. Hardware: HP Proliant DL580G3, Intel e100 dual port and e1000 dual port (e1000-6.0.60-1 e100-3.1.4q-1) A strace of the java software shows that the futex call sometimes just hangs for 10sec, 20sec, 30secs or even up to 60secs. The higher the nfs-load the more often it hangs. 13:59:36 gettimeofday({1124366376, 393387}, NULL) = 0 13:59:36 futex(0x807dbc4, FUTEX_WAIT, 5203, {0, 544848000}) = -1 ETIMEDOUT (Connection timed out) 13:59:36 futex(0x805bf14, FUTEX_WAKE, 1) = 0 13:59:36 gettimeofday({1124366376, 952476}, NULL) = 0 > 13:59:36 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 gettimeofday({1124366386, 977241}, NULL) = 0 During the hanging state only the following part of the marked line is displayed: > 13:59:36 futex(0x809248c, FUTEX_WAIT, 2, NULL A simultaneous tcpdump on the NFS client shows that _all_ NLM Lock requests are answered within max. 300msecs, typically within 1-5 msecs. Thus, the nfs server does not seem to be the problem. The NFS mount options are (rw,rsize=8192,wsize=8192,hard,intr,addr=10.x.x.x). The problem occurs more often if there is more traffic. It can be reproduced on two seperate, but equivalent environments (pre-live and live). Version-Release number of selected component (if applicable): kernel-smp-2.4.21-32.0.1.EL nfs-utils-1.0.6-33EL How reproducible: Sometimes Steps to Reproduce: 1. heavily use file locking from two nfs clients to the same nfs server, on the same filesystem 2. measure the time required for a lock, 3. strace your program Actual Results: as above Additional info: Test program output [ 7ms] Read: 154854:nodeB - the time is: Thu Aug 18 13:59:35 CEST 2005 [ 19ms] Aquired write-lock. [ 12ms] Released lock. [ 150ms] written line. [ 87ms] ReadLock [ 10365ms] Read: 154856:nodeB - the time is: Thu Aug 18 13:59:36 CEST 2005 [ 169ms] ReadLock [ 184ms] Read: 154857:nodeB - the time is: Thu Aug 18 13:59:46 CEST 2005 [ 4ms] Aquired write-lock. [ 8ms] Released lock. [ 10038ms] written line. [ 120ms] ReadLock [ 129ms] Read: 154858:nodeB - the time is: Thu Aug 18 13:59:46 CEST 2005 [ 176ms] ReadLock [ 188ms] Read: 154859:nodeB - the time is: Thu Aug 18 13:59:47 CEST 2005 [ 9ms] Aquired write-lock. [ 21ms] Released lock. [ 32ms] written line. [ 6ms] ReadLock [ 11ms] Read: 154860:nodeB - the time is: Thu Aug 18 13:59:47 CEST 2005 strace: 13:59:36 gettimeofday({1124366376, 393235}, NULL) = 0 13:59:36 clock_gettime(0, 0xbfffbbac) = -1 ENOSYS (Function not implemented) 13:59:36 gettimeofday({1124366376, 393387}, NULL) = 0 13:59:36 futex(0x807dbc4, FUTEX_WAIT, 5203, {0, 544848000}) = -1 ETIMEDOUT (Conn ection timed out) 13:59:36 futex(0x805bf14, FUTEX_WAKE, 1) = 0 13:59:36 gettimeofday({1124366376, 952476}, NULL) = 0 13:59:36 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 gettimeofday({1124366386, 977241}, NULL) = 0 13:59:46 _llseek(3, 0, [8870654], SEEK_CUR) = 0 13:59:46 _llseek(3, 0, [8870654], SEEK_END) = 0 13:59:46 _llseek(3, 8870654, [8870654], SEEK_SET) = 0 13:59:46 fcntl64(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=8870654, len =2138612993}) = 0
this may be related to 141282, I'm seeing this problem still in kernel 2.4.21-32.0.1.ELsmp http://seer.support.veritas.com/docs/277241.htm Process 3759 attached - interrupt to quit futex(0xb65a5844, FUTEX_WAIT, 2, NULL and also something similar to the above, on a different server with a java thread (no nfs involved) --- SIGUSR2 (User defined signal 2) @ 0 (0) --- <... rt_sigsuspend resumed> ) = -1 EINTR (Interrupted system call) rt_sigreturn(0x791fe28c) = -1 EINTR (Interrupted system call) rt_sigreturn(0x800) = 1098 --- SIGILL (Illegal instruction) @ 0 (0) --- rt_sigreturn(0x800) = 1098 futex(0x805b174, FUTEX_WAIT, 2, NULL) = 0 futex(0x88b309c, 0x4 /* FUTEX_??? */, 1) = 0 futex(0x805b174, FUTEX_WAKE, 1) = 1 futex(0x805b1d4, FUTEX_WAIT, 2, NULL) = 0 futex(0x805b1d4, FUTEX_WAKE, 1) = 0 --- SIGUSR2 (User defined signal 2) @ 0 (0) --- rt_sigprocmask(SIG_BLOCK, NULL, [QUIT USR2], 8) = 0 rt_sigsuspend([QUIT] <unfinished ...> --- SIGUSR2 (User defined signal 2) @ 0 (0) --- <... rt_sigsuspend resumed> ) = -1 EINTR (Interrupted system call) rt_sigreturn(0x791fe28c) = -1 EINTR (Interrupted system call) rt_sigreturn(0x800) = 1098 --- SIGILL (Illegal instruction) @ 0 (0) --- rt_sigreturn(0x800) = 1098 futex(0x805b174, FUTEX_WAIT, 2, NULL) = 0 futex(0x805b174, FUTEX_WAKE, 1) = 1 futex(0x805b1d4, FUTEX_WAIT, 2, NULL) = 0 futex(0x805b1d4, FUTEX_WAKE, 1) = 0
also please see Bug 153722 and Bug 145021 appears the same issue is in the 2.6 kernel, as the problem has been reported in FC4, I've just seen it on 2.6.12-1.1447_FC4; range of apps now reported with same issue include veritas netbackup, java, up2date, yum, evolution
Just found another case where I can reproduce it, this time on the latest kernel from update6 2.4.21-37.ELsmp running the java installer for oracle 9i: /tmp/OraInstall2005-10-04_09-10-14AM/jre/bin/i386/native_threads/java Process 5649 attached - interrupt to quit futex(0xa2b6bbac, FUTEX_WAIT, 1211, NULL "out of the box" install for oracle9i on unmodified rhel3 update6 once again I got around this by reverting to linuxthreads (LD_ASSUME_KERNEL=2.4.19) it works fine when not using nptl.
I ran into this issue again with oracle application server 10g (9.0.4) on rhel3u6 2.4.21-37.ELsmp on 3 servers, all default installs of os and oas. Java process hangs on futex_wait; the version shipped with oracle is: java version "1.4.2_02" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_02-b03) Java HotSpot(TM) Client VM (build 1.4.2_02-b03, mixed mode) It happens exactly the same way each time, on 3 separate servers and I can reproduce it at will.
We have meanwhile switched to NFS over tcp and are seeing the same problem. This bug seems to be somewhere in the backports of the 2.6 features to the 2.4 kernel (nptl), etc. We will now test a plain vanilla kernel.org kernel and see if that helps us. Since this is a large enterprise customer we will have to re-open the calls we had open, before.
Did you get anywhere with the vanilla kernel?
I also ran into this problem with Unidata's ldm. I tried 2.4 and 2.6 kernels, every flavor of OS. The upstream ldm server would serve data for awhile, and then the child process' would report as "defunct" and the downstream servers couldn't reconnect, stopping all ldm communication. When doing an strace on the pid I see the futex wait. Sometimes I could reproduce it in a day, other times it would take weeks, but always the same result. I put in the ld_assume_kernel env. variable and it seems much more stable.
I was seeing the problem with Apache+PHP. I disabled nptl and updated everything to the very latest packages from RHN and this appears to have solved the problem, at least for me and at least for the moment (1 week+ and counting).
(In reply to comment #9) I wish to expand on Sarah Thompson's information. The Unidata LDM is coded to to the _XOPEN_SOURCE=500 standard and makes extensive use of read and write locking (via fcntl()) of an mmap()ed file. Fedora is the only known operating-system on which the LDM exhibits this problem. Regards, Steve Emmerson LDM Developer University Corporation for Atmospheric Research
I find this problem any time I run any version of java (tried a variety of 1.4.2 32-bit versions as well as 1.5.0_03 64-bit) in cron or batch; strangely, java works fine interactively. This is on AS3 update 6, kernel 2.4.21-37EL on x86_64. For example, to reproduce: jsquires@bobo:~$ batch at> /opt/jdk1.4.2_09/bin/java -version at> <EOT> job 4 at 2005-11-08 14:15 <wait> jsquires@bobo:~$ ps x | grep java 4035 ? SN 0:00 /opt/jdk1.4.2_09/bin/java -version jsquires@bobo:~$ strace -p 4035 Process 4035 attached - interrupt to quit [ Process PID=4035 runs in 32 bit mode. ] futex(0x80a9b84, FUTEX_WAIT, 1, NULL <unfinished ...> Process 4035 detached Using LD_ASSUME_KERNEL=2.4.1 fixes this for me, but it would be nice to find out what the root of the problem is.
Mark, Is this still an issue? P.
Prarit, I can't speak for Mark, but at least one LDM user has recently encountered this bug. I've given him this URL and asked if I can post his email address. Steve Emmerson
Prarit, James M. Pelagatti <jamie.edu> is the one who encountered this bug. You may contact him. He doesn't encounter it often. Here's his "uname -a": Linux llwxldm2 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux --Steve Emmerson
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.