Bug 166292

Summary:	NFS client, NLM, java 1.4.2_08, readlock hang on futex
Product:	Red Hat Enterprise Linux 3	Reporter:	mark meierjohann <mark>
Component:	kernel	Assignee:	Red Hat Kernel Manager <kernel-mgr>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.0	CC:	andy.fletcher, emmerson.steven, jsquires, mingo, petrides, sarah.e.thompson, sputhenp, steved, supermike, terryj
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-10-19 18:55:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	170417, 345301

Description mark meierjohann 2005-08-18 19:57:40 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050721 CentOS/1.0.6-1.4.1.centos4 Firefox/1.0.6

Description of problem:
A java based program heavily using file locking over NFS (NFS Server is RH AS 2.1, Veritas Cluster 2.2) hangs regularly (not allways) during nfs read and/or write locks. 

The NFS Client (udp) is running RedHat EL3, kernel-smp-2.4.21-32.0.1.EL nfs-utils-1.0.6-33EL, iptables (no rejects in log, all rejects logged), SUN j2re1.4.2_08 (reproducable with IBM JRE, from RedHat EL Extras), bonding-1.0.4t-1. 
Hardware: HP Proliant DL580G3, Intel e100 dual port and e1000 dual port (e1000-6.0.60-1 e100-3.1.4q-1)


A strace of the java software shows that the futex call sometimes just hangs for 10sec, 20sec, 30secs or even up to 60secs. The higher the nfs-load the more often it hangs. 

  13:59:36 gettimeofday({1124366376, 393387}, NULL) = 0
  13:59:36 futex(0x807dbc4, FUTEX_WAIT, 5203, {0, 544848000}) = -1 ETIMEDOUT   
    (Connection timed out)
  13:59:36 futex(0x805bf14, FUTEX_WAKE, 1) = 0
  13:59:36 gettimeofday({1124366376, 952476}, NULL) = 0
> 13:59:36 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0    
  13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0
  13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0
  13:59:46 gettimeofday({1124366386, 977241}, NULL) = 0

During the hanging state only the following part of the marked line is displayed: 
> 13:59:36 futex(0x809248c, FUTEX_WAIT, 2, NULL

A simultaneous tcpdump on the NFS client shows that _all_ NLM Lock requests are answered within max. 300msecs, typically within 1-5 msecs. Thus, the nfs server does not seem to be the problem.

The NFS mount options are (rw,rsize=8192,wsize=8192,hard,intr,addr=10.x.x.x).

The problem occurs more often if there is more traffic. It can be reproduced on two seperate, but equivalent environments (pre-live and live).

Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-32.0.1.EL nfs-utils-1.0.6-33EL

How reproducible:
Sometimes

Steps to Reproduce:
1. heavily use file locking from two nfs clients to the same nfs server, on the same filesystem
2. measure the time required for a lock,
3. strace your program
  

Actual Results:  as above

Additional info:

Test program output
[      7ms] Read: 154854:nodeB - the time is: Thu Aug 18 13:59:35 CEST 2005
[     19ms] Aquired write-lock.
[     12ms] Released lock.
[    150ms] written line.
[     87ms] ReadLock
[  10365ms] Read: 154856:nodeB - the time is: Thu Aug 18 13:59:36 CEST 2005
[    169ms] ReadLock
[    184ms] Read: 154857:nodeB - the time is: Thu Aug 18 13:59:46 CEST 2005
[      4ms] Aquired write-lock.
[      8ms] Released lock.
[  10038ms] written line.
[    120ms] ReadLock
[    129ms] Read: 154858:nodeB - the time is: Thu Aug 18 13:59:46 CEST 2005
[    176ms] ReadLock
[    188ms] Read: 154859:nodeB - the time is: Thu Aug 18 13:59:47 CEST 2005
[      9ms] Aquired write-lock.
[     21ms] Released lock.
[     32ms] written line.
[      6ms] ReadLock
[     11ms] Read: 154860:nodeB - the time is: Thu Aug 18 13:59:47 CEST 2005

strace:
13:59:36 gettimeofday({1124366376, 393235}, NULL) = 0
13:59:36 clock_gettime(0, 0xbfffbbac)   = -1 ENOSYS (Function not implemented)
13:59:36 gettimeofday({1124366376, 393387}, NULL) = 0
13:59:36 futex(0x807dbc4, FUTEX_WAIT, 5203, {0, 544848000}) = -1 ETIMEDOUT (Conn
ection timed out)
13:59:36 futex(0x805bf14, FUTEX_WAKE, 1) = 0
13:59:36 gettimeofday({1124366376, 952476}, NULL) = 0
13:59:36 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0
13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0
13:59:46 futex(0x809248c, FUTEX_WAIT, 2, NULL) = 0
13:59:46 gettimeofday({1124366386, 977241}, NULL) = 0
13:59:46 _llseek(3, 0, [8870654], SEEK_CUR) = 0
13:59:46 _llseek(3, 0, [8870654], SEEK_END) = 0
13:59:46 _llseek(3, 8870654, [8870654], SEEK_SET) = 0
13:59:46 fcntl64(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=8870654, len
=2138612993}) = 0

Comment 1 mike 2005-09-12 16:31:54 UTC

this may be related to 141282, I'm seeing this problem still in kernel
2.4.21-32.0.1.ELsmp
http://seer.support.veritas.com/docs/277241.htm

Process 3759 attached - interrupt to quit
futex(0xb65a5844, FUTEX_WAIT, 2, NULL

and also something similar to the above, on a different server with a java
thread (no nfs involved)

--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
<... rt_sigsuspend resumed> )           = -1 EINTR (Interrupted system call)
rt_sigreturn(0x791fe28c)                = -1 EINTR (Interrupted system call)
rt_sigreturn(0x800)                     = 1098
--- SIGILL (Illegal instruction) @ 0 (0) ---
rt_sigreturn(0x800)                     = 1098
futex(0x805b174, FUTEX_WAIT, 2, NULL)   = 0
futex(0x88b309c, 0x4 /* FUTEX_??? */, 1) = 0
futex(0x805b174, FUTEX_WAKE, 1)         = 1
futex(0x805b1d4, FUTEX_WAIT, 2, NULL)   = 0
futex(0x805b1d4, FUTEX_WAKE, 1)         = 0
--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
rt_sigprocmask(SIG_BLOCK, NULL, [QUIT USR2], 8) = 0
rt_sigsuspend([QUIT] <unfinished ...>
--- SIGUSR2 (User defined signal 2) @ 0 (0) ---
<... rt_sigsuspend resumed> )           = -1 EINTR (Interrupted system call)
rt_sigreturn(0x791fe28c)                = -1 EINTR (Interrupted system call)
rt_sigreturn(0x800)                     = 1098
--- SIGILL (Illegal instruction) @ 0 (0) ---
rt_sigreturn(0x800)                     = 1098
futex(0x805b174, FUTEX_WAIT, 2, NULL)   = 0
futex(0x805b174, FUTEX_WAKE, 1)         = 1
futex(0x805b1d4, FUTEX_WAIT, 2, NULL)   = 0
futex(0x805b1d4, FUTEX_WAKE, 1)         = 0

Comment 2 mike 2005-09-14 19:01:20 UTC

also please see Bug 153722 and Bug 145021
appears the same issue is in the 2.6 kernel, as the problem has been reported in
FC4, I've just seen it on 2.6.12-1.1447_FC4; range of apps now reported with
same issue include veritas netbackup, java, up2date, yum, evolution

Comment 3 mike 2005-10-04 16:27:30 UTC

Just found another case where I can reproduce it, this time on the latest kernel
from update6 2.4.21-37.ELsmp
 
running the java installer for oracle 9i:
 
/tmp/OraInstall2005-10-04_09-10-14AM/jre/bin/i386/native_threads/java
 
Process 5649 attached - interrupt to quit
futex(0xa2b6bbac, FUTEX_WAIT, 1211, NULL
 
"out of the box" install for oracle9i on unmodified rhel3 update6 

once again I got around this by reverting to linuxthreads (LD_ASSUME_KERNEL=2.4.19)
it works fine when not using nptl.

Comment 5 mike 2005-10-07 15:34:49 UTC

I ran into this issue again with oracle application server 10g (9.0.4) on rhel3u6
2.4.21-37.ELsmp on 3 servers, all default installs of os and oas.  Java process
hangs on futex_wait; the version shipped with oracle is:
java version "1.4.2_02"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_02-b03)
Java HotSpot(TM) Client VM (build 1.4.2_02-b03, mixed mode)
 
It happens exactly the same way each time, on 3 separate servers and I can
reproduce it at will.

Comment 6 mark meierjohann 2005-10-18 14:08:50 UTC

We have meanwhile switched to NFS over tcp and are seeing the same problem. This
bug seems to be somewhere in the backports of the 2.6 features to the 2.4 kernel
(nptl), etc.

We will now test a plain vanilla kernel.org kernel and see if that helps us.

Since this is a large enterprise customer we will have to re-open the calls we
had open, before.

Comment 7 Andy Fletcher 2005-10-22 00:16:48 UTC

Did you get anywhere with the vanilla kernel?

Comment 9 Sarah Thompson 2005-11-01 20:00:12 UTC

I also ran into this problem with Unidata's ldm.  I tried 2.4 and 2.6 kernels,
every flavor of OS.  The upstream ldm server would serve data for awhile, and
then the child process' would report as "defunct" and the downstream servers
couldn't reconnect, stopping all ldm communication.  When doing an strace on the
pid I see the futex wait.  Sometimes I could reproduce it in a day, other times
it would take weeks, but always the same result.  I put in the ld_assume_kernel
env. variable and it seems much more stable.

Comment 10 Sarah Thompson 2005-11-01 20:00:25 UTC

I also ran into this problem with Unidata's ldm.  I tried 2.4 and 2.6 kernels,
every flavor of OS.  The upstream ldm server would serve data for awhile, and
then the child process' would report as "defunct" and the downstream servers
couldn't reconnect, stopping all ldm communication.  When doing an strace on the
pid I see the futex wait.  Sometimes I could reproduce it in a day, other times
it would take weeks, but always the same result.  I put in the ld_assume_kernel
env. variable and it seems much more stable.

Comment 11 Andy Fletcher 2005-11-01 20:38:05 UTC

I was seeing the problem with Apache+PHP. I disabled nptl and updated everything
to the very latest packages from RHN and this appears to have solved the
problem, at least for me and at least for the moment (1 week+ and counting).

Comment 12 Steven Emmerson 2005-11-02 17:14:29 UTC

(In reply to comment #9)

I wish to expand on Sarah Thompson's information.  The Unidata LDM is coded to
to the _XOPEN_SOURCE=500 standard and makes extensive use of read and write
locking (via fcntl()) of an mmap()ed file.  Fedora is the only known
operating-system on which the LDM exhibits this problem.

Regards,
Steve Emmerson
LDM Developer
University Corporation for Atmospheric Research

Comment 13 Julian Squires 2005-11-08 16:49:44 UTC

I find this problem any time I run any version of java (tried a variety of 1.4.2
32-bit versions as well as 1.5.0_03 64-bit) in cron or batch; strangely, java
works fine interactively.  This is on AS3 update 6, kernel 2.4.21-37EL on x86_64.

For example, to reproduce:

jsquires@bobo:~$ batch
at> /opt/jdk1.4.2_09/bin/java -version
at> <EOT>
job 4 at 2005-11-08 14:15
<wait>
jsquires@bobo:~$ ps x | grep java
 4035 ?        SN     0:00 /opt/jdk1.4.2_09/bin/java -version
jsquires@bobo:~$ strace -p 4035
Process 4035 attached - interrupt to quit
[ Process PID=4035 runs in 32 bit mode. ]
futex(0x80a9b84, FUTEX_WAIT, 1, NULL <unfinished ...>
Process 4035 detached

Using LD_ASSUME_KERNEL=2.4.1 fixes this for me, but it would be nice to find out
what the root of the problem is.

Comment 22 Prarit Bhargava 2007-09-12 17:59:51 UTC

Mark,

Is this still an issue?

P.

Comment 23 Steven Emmerson 2007-09-12 19:36:24 UTC

Prarit,

I can't speak for Mark, but at least one LDM user has recently encountered this
bug.  I've given him this URL and asked if I can post his email address.

Steve Emmerson

Comment 25 Steven Emmerson 2007-09-12 23:06:14 UTC

Prarit,

James M. Pelagatti <jamie.edu> is the one who encountered this bug.  You
may contact him.  He doesn't encounter it often.  Here's his "uname -a":

 Linux llwxldm2 2.6.9-55.ELsmp #1 SMP Fri Apr 20 16:36:54 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux

--Steve Emmerson

Comment 26 RHEL Program Management 2007-10-19 18:55:47 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.