Bringing this RHEL3 BZ forward to RHEL4, since the problem still exists +++ This bug was initially created as a clone of Bug #166292 +++ From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050721 CentOS/1.0.6-1.4.1.centos4 Firefox/1.0.6 Description of problem: A java based program heavily using file locking over NFS (NFS Server is RH AS 2.1, Veritas Cluster 2.2) hangs regularly (not allways) during nfs read and/or write locks. The NFS Client (udp) is running RedHat EL3, kernel-smp-2.4.21-32.0.1.EL nfs-utils-1.0.6-33EL, iptables (no rejects in log, all rejects logged), SUN j2re1.4.2_08 (reproducable with IBM JRE, from RedHat EL Extras), bonding-1.0.4t-1. Hardware: HP Proliant DL580G3, Intel e100 dual port and e1000 dual port (e1000-6.0.60-1 e100-3.1.4q-1) A strace of the java software shows that the futex call sometimes just hangs for 10sec, 20sec, 30secs or even up to 60secs. The higher the nfs-load the more often it hangs. 13:59:36 gettimeofday({1124366376, 393387}, NULL) = 0 13:59:36 futex(0x807dbc4, FUTEX_WAIT, 5203, {0, 544848000}) = -1 ETIMEDOUT (Connection timed out) 13:59:36 futex(0x805bf14, FUTEX_WAKE, 1) = 0 13:59:36 gettimeofday({1124366376, 952476}, NULL) = 0 > 13:59:36 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 gettimeofday({1124366386, 977241}, NULL) = 0 During the hanging state only the following part of the marked line is displayed: > 13:59:36 futex(0x809248c, FUTEX_WAIT, 2, NULL A simultaneous tcpdump on the NFS client shows that _all_ NLM Lock requests are answered within max. 300msecs, typically within 1-5 msecs. Thus, the nfs server does not seem to be the problem. The NFS mount options are (rw,rsize=8192,wsize=8192,hard,intr,addr=10.x.x.x). The problem occurs more often if there is more traffic. It can be reproduced on two seperate, but equivalent environments (pre-live and live). Version-Release number of selected component (if applicable): kernel-smp-2.4.21-32.0.1.EL nfs-utils-1.0.6-33EL How reproducible: Sometimes Steps to Reproduce: 1. heavily use file locking from two nfs clients to the same nfs server, on the same filesystem 2. measure the time required for a lock, 3. strace your program Actual Results: as above Additional info: Test program output [ 7ms] Read: 154854:nodeB - the time is: Thu Aug 18 13:59:35 CEST 2005 [ 19ms] Aquired write-lock. [ 12ms] Released lock. [ 150ms] written line. [ 87ms] ReadLock [ 10365ms] Read: 154856:nodeB - the time is: Thu Aug 18 13:59:36 CEST 2005 [ 169ms] ReadLock [ 184ms] Read: 154857:nodeB - the time is: Thu Aug 18 13:59:46 CEST 2005 [ 4ms] Aquired write-lock. [ 8ms] Released lock. [ 10038ms] written line. [ 120ms] ReadLock [ 129ms] Read: 154858:nodeB - the time is: Thu Aug 18 13:59:46 CEST 2005 [ 176ms] ReadLock [ 188ms] Read: 154859:nodeB - the time is: Thu Aug 18 13:59:47 CEST 2005 [ 9ms] Aquired write-lock. [ 21ms] Released lock. [ 32ms] written line. [ 6ms] ReadLock [ 11ms] Read: 154860:nodeB - the time is: Thu Aug 18 13:59:47 CEST 2005 strace: 13:59:36 gettimeofday({1124366376, 393235}, NULL) = 0 13:59:36 clock_gettime(0, 0xbfffbbac) = -1 ENOSYS (Function not implemented) 13:59:36 gettimeofday({1124366376, 393387}, NULL) = 0 13:59:36 futex(0x807dbc4, FUTEX_WAIT, 5203, {0, 544848000}) = -1 ETIMEDOUT (Conn ection timed out) 13:59:36 futex(0x805bf14, FUTEX_WAKE, 1) = 0 13:59:36 gettimeofday({1124366376, 952476}, NULL) = 0 13:59:36 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0 13:59:46 gettimeofday({1124366386, 977241}, NULL) = 0 13:59:46 _llseek(3, 0, [8870654], SEEK_CUR) = 0 13:59:46 _llseek(3, 0, [8870654], SEEK_END) = 0 13:59:46 _llseek(3, 8870654, [8870654], SEEK_SET) = 0 13:59:46 fcntl64(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=8870654, len =2138612993}) = 0 -- Additional comment from supermike on 2005-09-12 12:31 EST -- this may be related to 141282, I'm seeing this problem still in kernel 2.4.21-32.0.1.ELsmp http://seer.support.veritas.com/docs/277241.htm Process 3759 attached - interrupt to quit futex(0xb65a5844, FUTEX_WAIT, 2, NULL and also something similar to the above, on a different server with a java thread (no nfs involved) --- SIGUSR2 (User defined signal 2) @ 0 (0) --- <... rt_sigsuspend resumed> ) = -1 EINTR (Interrupted system call) rt_sigreturn(0x791fe28c) = -1 EINTR (Interrupted system call) rt_sigreturn(0x800) = 1098 --- SIGILL (Illegal instruction) @ 0 (0) --- rt_sigreturn(0x800) = 1098 futex(0x805b174, FUTEX_WAIT, 2, NULL) = 0 futex(0x88b309c, 0x4 /* FUTEX_??? */, 1) = 0 futex(0x805b174, FUTEX_WAKE, 1) = 1 futex(0x805b1d4, FUTEX_WAIT, 2, NULL) = 0 futex(0x805b1d4, FUTEX_WAKE, 1) = 0 --- SIGUSR2 (User defined signal 2) @ 0 (0) --- rt_sigprocmask(SIG_BLOCK, NULL, [QUIT USR2], 8) = 0 rt_sigsuspend([QUIT] <unfinished ...> --- SIGUSR2 (User defined signal 2) @ 0 (0) --- <... rt_sigsuspend resumed> ) = -1 EINTR (Interrupted system call) rt_sigreturn(0x791fe28c) = -1 EINTR (Interrupted system call) rt_sigreturn(0x800) = 1098 --- SIGILL (Illegal instruction) @ 0 (0) --- rt_sigreturn(0x800) = 1098 futex(0x805b174, FUTEX_WAIT, 2, NULL) = 0 futex(0x805b174, FUTEX_WAKE, 1) = 1 futex(0x805b1d4, FUTEX_WAIT, 2, NULL) = 0 futex(0x805b1d4, FUTEX_WAKE, 1) = 0 -- Additional comment from supermike on 2005-09-14 15:01 EST -- also please see Bug 153722 and Bug 145021 appears the same issue is in the 2.6 kernel, as the problem has been reported in FC4, I've just seen it on 2.6.12-1.1447_FC4; range of apps now reported with same issue include veritas netbackup, java, up2date, yum, evolution -- Additional comment from supermike on 2005-10-04 12:27 EST -- Just found another case where I can reproduce it, this time on the latest kernel from update6 2.4.21-37.ELsmp running the java installer for oracle 9i: /tmp/OraInstall2005-10-04_09-10-14AM/jre/bin/i386/native_threads/java Process 5649 attached - interrupt to quit futex(0xa2b6bbac, FUTEX_WAIT, 1211, NULL "out of the box" install for oracle9i on unmodified rhel3 update6 once again I got around this by reverting to linuxthreads (LD_ASSUME_KERNEL=2.4.19) it works fine when not using nptl. -- Additional comment from tao on 2005-10-04 16:18 EST -- From User-Agent: XML-RPC I don't see any evidence that the patch provided by the developers in bug 141282 was proven to fix the issue. I'll go ahead and send this up, since it does seem like the same problem. This event sent from IssueTracker by streeter issue 80399 -- Additional comment from supermike on 2005-10-07 11:34 EST -- I ran into this issue again with oracle application server 10g (9.0.4) on rhel3u6 2.4.21-37.ELsmp on 3 servers, all default installs of os and oas. Java process hangs on futex_wait; the version shipped with oracle is: java version "1.4.2_02" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_02-b03) Java HotSpot(TM) Client VM (build 1.4.2_02-b03, mixed mode) It happens exactly the same way each time, on 3 separate servers and I can reproduce it at will. -- Additional comment from mark.meierjohann on 2005-10-18 10:08 EST -- We have meanwhile switched to NFS over tcp and are seeing the same problem. This bug seems to be somewhere in the backports of the 2.6 features to the 2.4 kernel (nptl), etc. We will now test a plain vanilla kernel.org kernel and see if that helps us. Since this is a large enterprise customer we will have to re-open the calls we had open, before. -- Additional comment from steved on 2005-10-28 05:58 EST -- Ingo, It really seem like this is a dup of bz141282 but they are claiming the bug still exists.... do you think it still is a bug with futexs or something else like NLM.... -- Additional comment from sarah.e.thompson on 2005-11-01 15:00 EST -- I also ran into this problem with Unidata's ldm. I tried 2.4 and 2.6 kernels, every flavor of OS. The upstream ldm server would serve data for awhile, and then the child process' would report as "defunct" and the downstream servers couldn't reconnect, stopping all ldm communication. When doing an strace on the pid I see the futex wait. Sometimes I could reproduce it in a day, other times it would take weeks, but always the same result. I put in the ld_assume_kernel env. variable and it seems much more stable. -- Additional comment from sarah.e.thompson on 2005-11-01 15:00 EST -- I also ran into this problem with Unidata's ldm. I tried 2.4 and 2.6 kernels, every flavor of OS. The upstream ldm server would serve data for awhile, and then the child process' would report as "defunct" and the downstream servers couldn't reconnect, stopping all ldm communication. When doing an strace on the pid I see the futex wait. Sometimes I could reproduce it in a day, other times it would take weeks, but always the same result. I put in the ld_assume_kernel env. variable and it seems much more stable. -- Additional comment from andy.fletcher on 2005-11-01 15:38 EST -- I was seeing the problem with Apache+PHP. I disabled nptl and updated everything to the very latest packages from RHN and this appears to have solved the problem, at least for me and at least for the moment (1 week+ and counting). -- Additional comment from emmerson on 2005-11-02 12:14 EST -- (In reply to comment #9) I wish to expand on Sarah Thompson's information. The Unidata LDM is coded to to the _XOPEN_SOURCE=500 standard and makes extensive use of read and write locking (via fcntl()) of an mmap()ed file. Fedora is the only known operating-system on which the LDM exhibits this problem. Regards, Steve Emmerson LDM Developer University Corporation for Atmospheric Research -- Additional comment from jsquires on 2005-11-08 11:49 EST -- I find this problem any time I run any version of java (tried a variety of 1.4.2 32-bit versions as well as 1.5.0_03 64-bit) in cron or batch; strangely, java works fine interactively. This is on AS3 update 6, kernel 2.4.21-37EL on x86_64. For example, to reproduce: jsquires@bobo:~$ batch at> /opt/jdk1.4.2_09/bin/java -version at> <EOT> job 4 at 2005-11-08 14:15 <wait> jsquires@bobo:~$ ps x | grep java 4035 ? SN 0:00 /opt/jdk1.4.2_09/bin/java -version jsquires@bobo:~$ strace -p 4035 Process 4035 attached - interrupt to quit [ Process PID=4035 runs in 32 bit mode. ] futex(0x80a9b84, FUTEX_WAIT, 1, NULL <unfinished ...> Process 4035 detached Using LD_ASSUME_KERNEL=2.4.1 fixes this for me, but it would be nice to find out what the root of the problem is. -- Additional comment from tao on 2006-06-06 14:17 EST -- Follow on to yesterday's call... when we originally opened this case, we had a reproducer. Turned out to be poor programming (malloc in a signal handler) that was causing the problem so we closed it. Problem is that we're now hitting this w/ Veritas netbackup and the signature looks the same (but, of course, we can't confirm w/o code). perl5.6 -le 'print $$; $SIG{CHLD}="ignore"; system "false"' Hangs about 1 in 10 times (the typo, using ignore rather than IGNORE is intentional) becuase 5.6 tries to alloc a string for the error message in the signal handler. 5.8 does not do this. Could probably hack up a C example pretty quickly if you have a hard time w/ this example. This event sent from IssueTracker by streeter issue 81271 -- Additional comment from tao on 2006-06-06 14:17 EST -- File uploaded: sysrq.out This event sent from IssueTracker by streeter issue 81271 it_file 62490 -- Additional comment from tao on 2006-06-06 14:17 EST -- Attached, the perl backtrace is as follows (including the zombie which inevitably occurs): perl S 00000001 3620 1448 1388 1449 (NOTLB) Call Trace: [<c013562c>] schedule_timeout [kernel] 0xbc (0xdbcbfe58) [<c013d910>] futex_wait [kernel] 0x390 (0xdbcbfe90) [<c013cca0>] futex_vcache_callback [kernel] 0x0 (0xdbcbfea0) [<c0144668>] handle_mm_fault [kernel] 0x288 (0xdbcbfeb8) [<c013cca0>] futex_vcache_callback [kernel] 0x0 (0xdbcbfef4) [<c010c01e>] do_signal [kernel] 0x8e (0xdbcbff20) [<c013ceb7>] do_futex [kernel] 0x157 (0xdbcbff58) [<c0128137>] __mmdrop [kernel] 0x47 (0xdbcbff6c) [<c013cf99>] sys_futex [kernel] 0xb9 (0xdbcbff88) [<c02af06f>] no_timing [kernel] 0x7 (0xdbcbffc0) false Z DBCBE000 8 1449 1448 (L-TLB) Call Trace: [<c012dc20>] do_exit [kernel] 0x370 (0xdbaedf90) [<c012ddcb>] do_group_exit [kernel] 0x8b (0xdbaedfac) [<c02af06f>] no_timing [kernel] 0x7 (0xdbaedfc0) Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by streeter issue 81271 -- Additional comment from streeter on 2006-06-06 14:20 EST -- customer has a simple perl script reproducer in comment 14. sysrq-t while hung attached in comment 15. -- Additional comment from staubach on 2006-06-16 15:54 EST -- This is not an NFS bug. It is something to do with the futex support. I also have a reproducer too, if another one would be desired. -- Additional comment from emmerson on 2007-09-12 15:36 EST -- Prarit, I can't speak for Mark, but at least one LDM user has recently encountered this bug. I've given him this URL and asked if I can post his email address. Steve Emmerson -- Additional comment from fhirtz on 2007-09-12 15:41 EST -- We have a reproduction case which shows it on RHEL3 and RHEL4 (though less frequent there), so it's still existent. -- Additional comment from emmerson on 2007-09-12 19:06 EST -- Prarit, James M. Pelagatti <jamie.edu> is the one who encountered this bug. You may contact him. He doesn't encounter it often. Here's his "uname -a": Linux llwxldm2 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux --Steve Emmerson
Several different issues are mixed together in this BZ and related ITs. The original report was about file locking over NFS (using NLM). In some cases it takes tens of seconds to obtain the lock. It could be caused by a race condition where NLM_GRANTED_MSG from the server arrives before the client prepares for it. That would be bug 432855. A patched kernel for testing is available from Jeff Layton's page: http://people.redhat.com/jlayton/ Could someone with the long NFS locking delays test it? Then we have some futex-related reports here, e.g. the hangs with LDM. This could be explained by bug 217067 which was closed in November. Could an LDM user retest the futex problem with a current RHEL4 kernel? I've been working on the NFS locking bug before bug 432855 came to my attention today. During that, I wrote a testcase for NLM which shows another bug. When a client waiting for NLM_GRANTED_MSG is interrupted by a signal, the file stays locked forever. The bug is present in RHEL5 and current upstream. I suspect this code in do_setlk() is harmful: if (status == -EINTR || status == -ERESTARTSYS) { do_vfs_lock(filp, fl); }
The patch for bug 432855 is now present in current RHEL4 development builds: http://people.redhat.com/vgoyal/rhel4/ It should fix the 30 seconds NFS locking delays. Please test. The futex hang is something else completely and should have been in a separate BZ. Possibly fixed already in kernel-2.6.9-67.EL. I'm discussing the problem of files remaining locked forever on NFS on upstream mailing list. This will be another BZ. Devel NAK: this bug is either a duplicate of 432855 or 217067 (then it doesn't really matter) or it's not (then I won't be able to provide a fix in 4.7).
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.