+++ This bug was initially created as a clone of Bug #183565 +++ Description of problem: Processes takes fcntl() locks on an NFS mounted filesystem. Process is killed or terminates abnormally. Lock is not released and remains on the NFS server, associated with the now-dead PID. Version-Release number of selected component (if applicable): RHEL3 U6 - 2.4.21-37.ELsmp How reproducible: With attached code it can be reliably reproduced in our environment with 1-2 tries. Steps to Reproduce: 1. Compile ldisp1.c and lockit.c from attached tarball, assemble compiled binaries and locktest.ksh in an NFS mounted directory. In our case, the NFS server is a netapp filer. Mount options are: rw,nosuid,v3,rsize=8192,wsize=8192,hard,intr,tcp,lock,noacl,noac 2. Run locktest.ksh in same directory. Let it run for a few seconds, then interrupt with ctrl-c 3. Run ldisp on the resulting LTEST file. Actual results: Over half the time, ldisp will report that a lock is still held on the file, associated with a dead PID. Expected results: It should not be possible for a process to die without releasing locks that Linux is aware of. Additional info: Additional data in IT 89052 -- Additional comment from imcleod on 2006-03-01 19:34 EST -- Created an attachment (id=125508) Source for duplicator -- Additional comment from steve.russell on 2006-03-02 12:25 EST -- Minor correction, mount option is 'nointr' rather than 'intr' -- Additional comment from steved on 2006-03-16 13:46 EST -- Created an attachment (id=126239) purposed patch It appears the mechanism which enqueues interrupted lock requests fails to *always* que up a unlock request. This patch should ensure that all interrupted lock requests are immediately followed by an async unlock request. -- Additional comment from imcleod on 2006-03-23 09:21 EST -- We've tested the attached patch in our environment and can still generate stale locks with our duplication script. Tcpdump and process output can be found in the associated issue tracker ticket. -- Additional comment from petrides on 2006-04-06 23:12 EST -- A fix for this problem has just been committed to the RHEL3 U8 patch pool this evening (in kernel version 2.4.21-40.6.EL). -- Additional comment from jlayton on 2006-06-05 10:08 EST -- Some small issues with the reproducer. First "v3" isn't a valid option, so my option string looks like this: rw,nosuid,nfsvers=3,rsize=8192,wsize=8192,hard,nointr,tcp,lock,noacl,noac Also, there were a number of errors in the ldisp1 reproducer program. I'll attach an updated version here in a bit. -- Additional comment from jlayton on 2006-06-05 10:11 EST -- Created an attachment (id=130502) updated ldisp1 program -- builds with fewer warnings with -Wall now -- Additional comment from jlayton on 2006-06-05 10:33 EST -- The other catch is that simply hitting ^c while running the locktest.ksh script does not seem to be sufficient. From looking over the code, I presume what they really want is this: 1) run locktest.ksh 2) wait until the ldisp1 output actually starts 3) pkill lockit2 That should kill off all of the lockit2 processes and ldisp should start just printing out lines with '...' with nothing between them. I was able to reproduce the issue after a few attempts with the above method: ... Lock:1 - Offset 100, Len 1, Pid 1438 ... # ps -ef | grep 1438 root 1452 1031 0 10:41 pts/0 00:00:00 grep 1438 # uname -a Linux tatooine.rdu.redhat.com 2.4.21-43.ELsmp #1 SMP Sun May 14 17:25:07 EDT 2006 i686 i686 i386 GNU/Linux -- Additional comment from jlayton on 2006-06-05 11:23 EST -- Building a kernel out of a recent git pull now to see if the problem stille exists there. -- Additional comment from jlayton on 2006-06-05 12:39 EST -- Upstream kernel's still building, but in looking over the code, I see that locks_unlock_delete calls f_op->lock, but doesn't check the return code. Not sure if this is significant however, but I may stick some debug code in there to print something if it's not 0, and try to reproduce the problem. -- Additional comment from jlayton on 2006-06-05 13:26 EST -- This bug exists upstream as well -- at least, as late as of 2.4.32. I've not tested later. Going to see now about building some test kernels to see if we can trap what's actually happening. -- Additional comment from jlayton on 2006-06-05 14:00 EST -- I instrumented that piece and it apparently still returns zero there, so that's not it. Going to drill down farther to the rpc layer... -- Additional comment from jlayton on 2006-06-06 08:05 EST -- Did some network captures yesterday. It's pretty clear that at some point we're locking the file, but never unlocking it. So definitely a client-side problem (but we were already pretty sure of that). I then modified the nfs_lock dprintk to show the cmd and PID, kicked off the test and then did this in another shell: # echo 32767 > /proc/sys/sunrpc/nfs_debug ; pkill lockit2; sleep 1; echo 0 > /proc/sys/sunrpc/nfs_debug I then saw this lock being held: ... Lock:1 - Offset 200, Len 1, Pid 3026 ... confirmed on the server: [root@barsoom ~]# cat /proc/locks 1: POSIX ADVISORY WRITE 3026 fd:04:507952 200 200 Log messages for this PID: # grep 3026 /mnt/barsoom/89052/nfslock1.log Jun 6 08:07:31 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=3026) Jun 6 08:07:31 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=5, pid=3026) Jun 6 08:07:32 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=6, pid=3026) Jun 6 08:07:32 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=5, pid=3026) Jun 6 08:07:32 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=3026) Jun 6 08:07:32 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=5, pid=3026) Jun 6 08:07:32 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=6, pid=3026) Jun 6 08:07:32 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=5, pid=3026) Jun 6 08:07:33 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=3026) Jun 6 08:07:33 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=5, pid=3026) Jun 6 08:07:33 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=6, pid=3026) Jun 6 08:07:33 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=5, pid=3026) Jun 6 08:07:33 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=3026) Jun 6 08:07:33 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=5, pid=3026) Jun 6 08:07:33 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=6, pid=3026) Jun 6 08:07:34 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=5, pid=3026) Jun 6 08:07:34 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=3026) Jun 6 08:07:34 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=5, pid=3026) Jun 6 08:07:34 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=6, pid=3026) Jun 6 08:07:34 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=5, pid=3026) Jun 6 08:07:34 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=3026) Jun 6 08:07:34 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=5, pid=3026) Jun 6 08:07:35 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=6, pid=3026) Jun 6 08:07:35 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=2, fl=1, r=100:100, cmd=6, pid=3026) -- Additional comment from jlayton on 2006-06-06 08:15 EST -- For reference: cmd: F_GETLK == 5 F_SETLK == 6 fl == fl->flags: FL_POSIX == 1 t (type): F_WRLCK == 1 F_UNLCK == 2 So it looks like the last log message in here was an unlock command for the range that's still locked. The nfs_lock function is getting called, but the operation isn't making it out to the wire for some reason. -- Additional comment from jlayton on 2006-06-06 09:02 EST -- Actually, I may have been wrong before. I did another test and got 2 stale locks: 1: POSIX ADVISORY WRITE 3479 fd:04:507952 100 100 2: POSIX ADVISORY WRITE 3472 fd:04:507952 200 200 I also added some sleeps in before and after the pkill command to allow the logs to catch up: Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=2, fl=1, r=200:200, cmd=6, pid=3472) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=6, pid=3472) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=5, pid=3472) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=3472) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=5, pid=3472) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=6, pid=3472) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=5, pid=3472) ------------------- Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=2, fl=1, r=100:100, cmd=6, pid=3479) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=3479) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=2, fl=1, r=200:200, cmd=6, pid=3479) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=6, pid=3479) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=5, pid=3479) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=3479) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=5, pid=3479) Jun 6 09:03:10 tatooine kernel: NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=100:100, cmd=6, pid=3479) ---------------------- So the last F_SETLK commands here for the ranges with the stale locks are lock commands. In this case, it looks like we attempted to lock the file but the unlock attempt never occured. Unfortunately, I don't have info here about whether the lock attempts were successful. Going to see about instrumenting that now... -- Additional comment from jlayton on 2006-06-06 09:29 EST -- Modified nfs_lock to so that the dprintk shows the return code: Here we get a successful lock: Jun 6 09:33:36 tatooine kernel: NFS: nfs_lock(f= 9/507955, t=1, fl=1, r=200:200, cmd=6, pid=5866) = 0 ...and a successful unlock: Jun 6 09:33:36 tatooine kernel: NFS: nfs_lock(f= 9/507955, t=2, fl=1, r=200:200, cmd=6, pid=5866) = 0 lock attempt with an EAGAIN: Jun 6 09:33:36 tatooine kernel: NFS: nfs_lock(f= 9/507955, t=1, fl=1, r=100:100, cmd=6, pid=5866) = -11 GETLK which shows it as clear: Jun 6 09:33:37 tatooine kernel: NFS: nfs_lock(f= 9/507955, t=1, fl=1, r=100:100, cmd=5, pid=5866) = 0 successful lock attempt: Jun 6 09:33:37 tatooine kernel: NFS: nfs_lock(f= 9/507955, t=1, fl=1, r=200:200, cmd=6, pid=5866) = 0 Then this strange beast, this is a GETLK, but type=2 (F_UNLK). Does this mean it's attempting to see if it can successfully unlock the file? Jun 6 09:33:37 tatooine kernel: NFS: nfs_lock(f= 9/507955, t=2, fl=1, r=200:200, cmd=5, pid=5866) = 0 That's the last command for this PID. -- Additional comment from jlayton on 2006-06-06 12:13 EST -- Ok, instrumented locks_unlock_delete to show when it was being called and by what PID. After pkilling all lockit2's I got this repeated: Lock:1 - Offset 200, Len 1, Pid 2023 but the debug message is from a different PID. So again, for some reason we're not even getting this far when freeing all of a process' locks: locks_unlock_delete: 2018 releasing 9/507952, r=100:100 -- Additional comment from jlayton on 2006-06-06 13:34 EST -- Added some debug code to locks_remove_posix. This time, PID 1184 had the stale lock: locks_remove_posix: pid=1172 fl_flags=1, fl_owner=d16df920, owner=d16df920 locks_remove_posix: pid=1185 fl_flags=1, fl_owner=d38e68c0, owner=d1120e20 locks_remove_posix: pid=1185 fl_flags=1, fl_owner=d16df5e0, owner=d1120e20 locks_remove_posix: pid=1187 fl_flags=1, fl_owner=d38e68c0, owner=d16dfac0 locks_remove_posix: pid=1187 fl_flags=1, fl_owner=d16df5e0, owner=d16dfac0 locks_remove_posix: pid=1184 fl_flags=1, fl_owner=d38e68c0, owner=d16df100 locks_remove_posix: pid=1184 fl_flags=1, fl_owner=d16df5e0, owner=d16df100 locks_remove_posix: pid=1178 fl_flags=1, fl_owner=d38e68c0, owner=d16df5e0 locks_remove_posix: pid=1181 fl_flags=1, fl_owner=d38e68c0, owner=d16df780 locks_remove_posix: pid=1190 fl_flags=1, fl_owner=d38e68c0, owner=d1120ae0 locks_remove_posix: pid=1180 fl_flags=1, fl_owner=d38e68c0, owner=d16df2a0 locks_remove_posix: pid=1186 fl_flags=1, fl_owner=d38e68c0, owner=d1120c80 locks_remove_posix: pid=1188 fl_flags=1, fl_owner=d38e68c0, owner=d16dfc60 The issue here seems to be that fl->fl_owner and owner do not match. This is preventing locks_unlock_delete from getting called. I'm not clear yet on where these values ultimately come from. -- Additional comment from jlayton on 2006-06-06 14:39 EST -- This time, the stale lock was this one: 1: POSIX ADVISORY WRITE 1224 fd:04:507952 200 200 Here's the debug messages from that period (some others precede it). So we don't match this particular lock here: locks_remove_posix: pid=1224 fl_flags=1, fl_owner=d3ffb520, owner=d3ffb6c0 lock_remove_posix: non-matching lock 1224 9/507952, r=100:100 locks_remove_posix: pid=1215 fl_flags=1, fl_owner=d3ffb520, owner=d359e6e0 lock_remove_posix: non-matching lock 1215 9/507952, r=100:100 locks_remove_posix: pid=1223 fl_flags=1, fl_owner=d3ffb520, owner=d1265600 lock_remove_posix: non-matching lock 1223 9/507952, r=100:100 But a little later, the fl_owner for this lock range changes? Maybe it was dropped and reallocated? locks_remove_posix: pid=1216 fl_flags=1, fl_owner=d1592440, owner=d1265c80 lock_remove_posix: non-matching lock 1216 9/507952, r=100:100 locks_remove_posix: pid=1219 fl_flags=1, fl_owner=d1592440, owner=d1592440 locks_unlock_delete: 1219 releasing 9/507952, r=100:100 Perhaps we're hitting a race where we allocate and get granted a lock just after the lock lists for the process have been cleaned out? -- Additional comment from jlayton on 2006-06-06 16:33 EST -- Even more strange, it almost looks like the nfs locking is allowing collisions here. Here are a number of nfs_lock commands and their results for PIDs 1314 and 1318: # dmesg -s 131072 | grep 200:200 | grep 'cmd=6' | egrep '(1314|1318)' NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1314) = 0 NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1318) = 0 NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1318) = 0 NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1318) = 0 NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1318) = 0 NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1318) = 0 NFS: nfs_lock(f= 9/507952, t=2, fl=1, r=200:200, cmd=6, pid=1314) = 0 Early on, a couple of other processes are also contending for the locks, but toward the end, neither of these processes are ever denied a SETLK. 1314 gets the lock, but then 1318 is also granted it, and 1318 never releases it. My understanding is that the locks in these programs are exclusive, so it doesn't seem like this should be allowed... -- Additional comment from petrides on 2006-06-06 16:55 EST -- SteveD, should I put this bug into FAILS_QA state or bounce it back into ASSIGNED state? It looks like the problem isn't fully resolved. -- Additional comment from jlayton on 2006-06-14 12:56 EST -- I've not, however, been able to reproduce this on FC5 as of yet, but the only FC5 box I have at the moment is UP, so that could also be a factor. Going to see if I can reproduce this on RHEL3/RHEL4 with UP kernels. -- Additional comment from jlayton on 2006-06-14 13:48 EST -- I was still able to reproduce this on RHEL3 UP kernel, going to check FC5 SMP in a bit. -- Additional comment from jlayton on 2006-06-14 14:05 EST -- after many tries, I was also able to reproduce this on FC5 SMP: Linux dantu.rdu.redhat.com 2.6.16-1.2122_FC5 #1 SMP Sun May 21 15:01:10 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux My guess is that it may also be reproducable on FC5 UP, but it might just take a bit longer (I'll try again to see if I can make it happen there too). So this bug also probably exists upstream, at least as of fairly recently. -- Additional comment from agud on 2006-06-14 18:37 EST -- I tried tracing calls to nlmclnt_proc from do_setlk. The test shows: Lock:1 - Offset 100, Len 1, Pid 28066 ... Modified fs/nfs/file.c:nfs_lock() grep "nfs_lock" /var/log/messages | grep 28066 Jun 14 17:42:55 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=100:100, cmd=5, pid=28066) Jun 14 17:42:56 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=200:200, cmd=6, pid=28066) Jun 14 17:42:57 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=200:200, cmd=5, pid=28066) Jun 14 17:42:59 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=100:100, cmd=6, pid=28066) Jun 14 17:43:00 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=100:100, cmd=5, pid=28066) Jun 14 17:43:02 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=200:200, cmd=6, pid=28066) Jun 14 17:43:03 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=2, fl=1, r=200:200, cmd=6, pid=28066) Jun 14 17:43:04 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=100:100, cmd=6, pid=28066) Jun 14 17:43:05 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=100:100, cmd=5, pid=28066) Jun 14 17:43:06 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=200:200, cmd=6, pid=28066) Jun 14 17:43:07 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=200:200, cmd=5, pid=28066) And modified fs/lockd/clntproc.c:nlmclnt_proc() grep "clnt proc returns" /var/log/messages | grep 28066 Jun 14 17:42:55 pro1 kernel: lockd: clnt proc returns -11, cmd=6, pid=28066 Jun 14 17:42:57 pro1 kernel: lockd: clnt proc returns -11, cmd=6, pid=28066 Jun 14 17:43:00 pro1 kernel: lockd: clnt proc returns -11, cmd=6, pid=28066 Jun 14 17:43:03 pro1 kernel: lockd: clnt proc returns 0, cmd=6, pid=28066 Jun 14 17:43:04 pro1 kernel: lockd: clnt proc returns 0, cmd=6, pid=28066 Jun 14 17:43:05 pro1 kernel: lockd: clnt proc returns 0, cmd=6, pid=28066 Jun 14 17:43:06 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=28066 Jun 14 17:43:06 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=28066 Jun 14 17:43:06 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=28066 Jun 14 17:43:07 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=28066 Jun 14 17:43:07 pro1 kernel: lockd: clnt proc returns -11, cmd=6, pid=28066 Jun 14 17:43:08 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=28066 Jun 14 17:43:08 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=28066 Jun 14 17:43:08 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=28066 Jun 14 17:43:08 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=28066 Jun 14 17:43:08 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=28066 Jun 14 17:43:09 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=28066 Jun 14 17:43:10 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=28066 Here, after process 28066 gets lock on 100:100, it revokes the lock. Later it tries to get lock on 200:200. But theres no unlock attempts whatsoever. I'm putting some debug statements right at locks_remove_posix to see when it actually initiates the unlocking, and also trying to trace where the values in owner field come from. -- Additional comment from petrides on 2006-06-14 18:55 EST -- Redoing lost reassignment from SteveD to Amit on Mon, 12 Jun 2006 10:44:24 -0400. -- Additional comment from petrides on 2006-06-14 18:57 EST -- Reapplying lost comment from jlayton on 2006-06-13 10:02 EST: [note that attachment has been lost] For this patch and run: Lock:1 - Offset 100, Len 1, Pid 1588 The relevant logs messages are here: close_files: closing all files for pid=1588 files_struct=db9e5280 locks_remove_posix: pid=-610396224 *before=00000000 locks_remove_posix: pid=1588 fl_owner=db6a5d00 owner=db9e5280 range=100:100 nlmclnt_proc: pid=1592 type=2 r=100:100 status=0 close_files: closing all files for pid=1592 files_struct=db6a5d00 nlmclnt_proc: pid=1598 type=1 r=200:200 status=0 close_files: closing all files for pid=1598 files_struct=dc333780 locks_remove_posix: pid=-610394304 *before=00000000 locks_remove_posix: pid=1598 fl_owner=dc333780 owner=dc333780 range=200:200 nlmclnt_proc: pid=1598 type=2 r=200:200 status=0 close_files: closing all files for pid=1590 files_struct=db9e5080 close_files: closing all files for pid=1597 files_struct=db6a5900 close_files: closing all files for pid=1606 files_struct=db6a5900 So 1592 released the lock around the same time that locks_remove_posix was being called for 1588. The interesting bit is that I did a network capture during this run as well. This packet comes in right near the end, and no more LOCK/UNLOCK calls are made after this one: 387 21.229495 10.10.1.9 -> 172.16.57.196 NLM V4 LOCK Call FH:0xa7c09108 svid:1598 pos:0-0 ; V4 UNLOCK Call FH:0xa7c09108 svid:1592 pos:0-0 ; V4 LOCK Call FH:0xa7c09108 svid:1588 pos:0-0 So we have 3 different calls in this one packet. A LOCK call for PID 1598 (which is for l_offset 200), and then an UNLOCK call for l_offset 100 from pid 1592, and then a LOCK call for l_offset 100 from 1588. So the server is granting this last lock request, but I think that by the time the response gets back to the client, the process has exited. I wonder if the issue is just timing with the polling loop for locks. Perhaps we need to disable the polling loop if the process is PF_EXITING? -- Additional comment from petrides on 2006-06-14 18:59 EST -- Reapplying lost comment from jlayton on 2006-06-13 12:21 EST: Actually, I think that by the time the process is PF_EXITING, we don't hit the polling loop again anyway. I think that what's happening is that the RPC lock call in the polling loop gets issued just before the process gets hit with the signal. The call goes out on the wire before the process can cancel it, but by the time the time the response gets back, the process has already exited. That said, it looks like (from Steve's patch above) that we should be able to handle this situation, so I'm not clear on where it's falling down. -- Additional comment from jlayton on 2006-06-15 07:47 EST -- Actually, I have a (somewhat vague) hunch about what this problem might be. While bugzilla was down yesterday, I had a similar situation to one I had posted above when I saw this: # dmesg -s 131072 | grep 200:200 | grep 'cmd=6' | egrep '(1314|1318)' NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1314) = 0 NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1318) = 0 NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1318) = 0 NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1318) = 0 NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1318) = 0 NFS: nfs_lock(f= 9/507952, t=1, fl=1, r=200:200, cmd=6, pid=1318) = 0 NFS: nfs_lock(f= 9/507952, t=2, fl=1, r=200:200, cmd=6, pid=1314) = 0 After seeing something like the above, when locks_remove_posix was called, it seemed to have no knowledge of a lock on 200:200. The issue here is that my debug patches at this point were being done after the call into the NLM client had returned. This info makes it look like the server is granting the lock to 1318 while 1314 still holds it. But network captures show that this isn't the case. What's happening here is something like this: 1314 requests the lock and gets it 1314 sends an UNLOCK to the server and then goes to sleep server sends a successful unlock reply to the client, but while in transit... 1318 sends a lock request 1318 gets the lock 1314 then gets the unlock reply Now all of this is legit from the standpoint of the server, but what I'm not clear on is whether the out-of-order set of operations here from the standpoint of the client processes is maybe causing the file_lock lists to look wrong. i.e., from the standpoint of the two processes above, the last operation to happen was the unlock by 1314. Might this make the kernel think that the above file was actually unlocked at the time that the processes are all exiting? Again, this is just speculation, but this has me thinking that the problem is not in the client/server nlm piece, but in how the file_lock lists are being handled. -- Additional comment from agud on 2006-06-15 22:06 EST -- Created an attachment (id=131020) debug patch 1 -- Additional comment from agud on 2006-06-15 22:07 EST -- I tried to see if last unlock is really called. With the attached debug patch, it seems like the unlock for the buggy process's lock is never called by the fs/locks.c:locks_unlock_delete() The test output shows: ... Lock:1 - Offset 100, Len 1, Pid 5026 ... # grep 5026 messages-06-15 | tail -n 20 Jun 15 21:02:06 pro1 kernel: lockd: clnt proc returns 0, cmd=6, pid=5026 Jun 15 21:02:06 pro1 kernel: NFS: END nfs_lock(t=1, r=200:200, cmd=6, pid=5026) = 0 Jun 15 21:02:06 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=2, fl=1, r=200:200, cmd=6, pid=5026) Jun 15 21:02:06 pro1 kernel: NFS: do_unlk(f= a/10506580, t=2, fl=1, r=200:200, cmd=6, pid=5026) Jun 15 21:02:06 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=5026 Jun 15 21:02:06 pro1 kernel: NFS: END nfs_lock(t=1, r=200:200, cmd=5, pid=5026) = 0 Jun 15 21:02:07 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=5026 Jun 15 21:02:07 pro1 kernel: NFS: END nfs_lock(t=1, r=200:200, cmd=5, pid=5026) = 0 Jun 15 21:02:07 pro1 kernel: lockd: clnt proc returns 0, cmd=6, pid=5026 Jun 15 21:02:07 pro1 kernel: NFS: END nfs_lock(t=2, r=200:200, cmd=6, pid=5026) = 0 Jun 15 21:02:07 pro1 kernel: NFS: posix_lock_file: caller(5026, dc97a900, 200:200, 2) before(5026, dc97a900, 200:200, 1) Jun 15 21:02:07 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=100:100, cmd=6, pid=5026) Jun 15 21:02:08 pro1 kernel: lockd: clnt proc returns 0, cmd=6, pid=5026 Jun 15 21:02:08 pro1 kernel: NFS: END nfs_lock(t=1, r=100:100, cmd=6, pid=5026) = 0 Jun 15 21:02:08 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=100:100, cmd=5, pid=5026) Jun 15 21:02:08 pro1 kernel: NFS: do_getlk(f= a/10506580, t=1, fl=1, r=100:100, cmd=5, pid=5026) Jun 15 21:02:08 pro1 kernel: lockd: clnt proc returns 0, cmd=5, pid=5026 Jun 15 21:02:08 pro1 kernel: NFS: END nfs_lock(t=1, r=100:100, cmd=5, pid=5026) = 0 Jun 15 21:02:08 pro1 kernel: lockd: clnt proc returns -4, cmd=5, pid=5026 Jun 15 21:02:08 pro1 kernel: NFS: END nfs_lock(t=1, r=100:100, cmd=5, pid=5026) = -4 Here 5026 never send unlock for the range 100:100. I think the file_lock updation occurs after the process is terminated, don't know if its possible in atleast in certain scenario. TCP dump confirms that there was no unlock call. # tethereal -r tcpdump1 | grep V4 | egrep '(NLM|NFS)' | tail -n 20 88231 18.659381 10.0.0.4 -> 10.0.0.1 NLM V4 LOCK Call FH:0x1d98c2f6 svid:5027 pos:0-0 88232 18.659418 10.0.0.4 -> 10.0.0.1 NLM V4 LOCK Call FH:0x1d98c2f6 svid:5029 pos:0-0 88233 18.659444 10.0.0.4 -> 10.0.0.1 NLM V4 UNLOCK Call FH:0x1d98c2f6 svid:5028 pos:0-0 88234 18.659518 10.0.0.4 -> 10.0.0.1 NLM V4 TEST Call FH:0x1d98c2f6 svid:5032 pos:0-0 88235 18.659534 10.0.0.1 -> 10.0.0.4 NLM V4 TEST Reply (Call In 88228) 88236 18.659553 10.0.0.4 -> 10.0.0.1 NLM V4 LOCK Call FH:0x1d98c2f6 svid:5030 pos:0-0 88238 18.659697 10.0.0.1 -> 10.0.0.4 NLM V4 LOCK Reply (Call In 88231) ; V4 LOCK Reply (Call In 88232) NLM_DENIED ; V4 UNLOCK Reply (Call In 88233) 88240 18.659864 10.0.0.4 -> 10.0.0.1 NLM V4 UNLOCK Call FH:0x1d98c2f6 svid:5027 pos:0-0 88241 18.659930 10.0.0.4 -> 10.0.0.1 NLM V4 TEST Call FH:0x1d98c2f6 svid:5029 pos:0-0 88242 18.659977 10.0.0.4 -> 10.0.0.1 NLM V4 LOCK Call FH:0x1d98c2f6 svid:5028 pos:0-0 88243 18.660000 10.0.0.4 -> 10.0.0.1 NLM V4 LOCK Call FH:0x1d98c2f6 svid:5026 pos:0-0 88244 18.660016 10.0.0.1 -> 10.0.0.4 NLM V4 TEST Reply (Call In 88234); V4 LOCK Reply (Call In 88236) 88246 18.660151 10.0.0.4 -> 10.0.0.1 NLM V4 UNLOCK Call FH:0x1d98c2f6 svid:5030 pos:0-0 88247 18.660172 10.0.0.4 -> 10.0.0.1 NLM V4 LOCK Call FH:0x1d98c2f6 svid:5024 pos:0-0 88249 18.660307 10.0.0.1 -> 10.0.0.4 NLM V4 UNLOCK Reply (Call In 88240) ; V4 TEST Reply (Call In 88241) ; V4 LOCK Reply (Call In 88242) ; V4 LOCK Reply (Call In 88243) NLM_DENIED 88250 18.660342 10.0.0.4 -> 10.0.0.1 NLM V4 LOCK Call FH:0x1d98c2f6 svid:5022 pos:0-0 88252 18.660433 10.0.0.4 -> 10.0.0.1 NLM V4 LOCK Call FH:0x1d98c2f6 svid:5027 pos:0-0 88253 18.660483 10.0.0.1 -> 10.0.0.4 NLM V4 UNLOCK Reply (Call In 88246) ; V4 LOCK Reply (Call In 88247) NLM_DENIED 88254 18.660491 10.0.0.4 -> 10.0.0.1 NLM V4 UNLOCK Call FH:0x1d98c2f6 svid:5028 pos:0-0 88255 18.660592 10.0.0.4 -> 10.0.0.1 NLM V4 TEST Call FH:0x1d98c2f6 svid:5026 pos:0-0 I'm now trying to get the debug output from the place where the file_lock is updated. -- Additional comment from agud on 2006-06-16 11:56 EST -- Strange thing I found in yesterday's syslog, is: Jun 15 21:02:08 pro1 kernel: NFS: nfs_lock(f= a/10506580, t=1, fl=1, r=100:100, cmd=6, pid=5032) Jun 15 21:02:09 pro1 kernel: lockd: clnt proc returns -4, cmd=6, pid=5032 Jun 15 21:02:09 pro1 kernel: lockd: clnt proc returns 0, cmd=6, pid=5032 Jun 15 21:02:09 pro1 kernel: NFS: END nfs_lock(t=2, r=100:100, cmd=6, pid=5032) = 0 5032 is the last one to unlock 200:200, but its type 't', apparently changed within the function, from 1 to 2. -- Additional comment from agud on 2006-06-16 16:15 EST -- With another debug patch for fs/locks.c, its very clear that the victim process ' lock is not registered in file_lock before the locks are cleaned up. ... Lock:1 - Offset 200, Len 1, Pid 1680 ... $ grep "cmd=6" messages-06-16-1 | grep 1680 | grep "t=1" | grep END NFS: END nfs_lock(t=1, r=100:100, cmd=6, pid=1680) = -11 NFS: END nfs_lock(t=1, r=200:200, cmd=6, pid=1680) = 0 NFS: END nfs_lock(t=1, r=200:200, cmd=6, pid=1680) = -11 NFS: END nfs_lock(t=1, r=100:100, cmd=6, pid=1680) = -11 NFS: END nfs_lock(t=1, r=200:200, cmd=6, pid=1680) = 0 $ grep locks_insert_lock messages-06-16-1 | grep 1680 NFS: locks_insert_lock: fl(1680, df82e680, 200:200, 1) 1680 takes 2 locks on 200:200, but only 1 is registered, thats the first one. Second one is not! Whereas, for any other process 1677 e.g. : $ grep locks_insert_lock messages-06-16-1 | grep 1677 NFS: locks_insert_lock: fl(1677, dd48ed80, 200:200, 1) NFS: locks_insert_lock: fl(1677, dd48ed80, 100:100, 1) NFS: locks_insert_lock: fl(1677, dd48ed80, 200:200, 1) $ grep "cmd=6" messages-06-16-1 | grep 1677 | grep "t=1" | grep END NFS: END nfs_lock(t=1, r=200:200, cmd=6, pid=1677) = -11 NFS: END nfs_lock(t=1, r=200:200, cmd=6, pid=1677) = -11 NFS: END nfs_lock(t=1, r=200:200, cmd=6, pid=1677) = 0 NFS: END nfs_lock(t=1, r=100:100, cmd=6, pid=1677) = 0 NFS: END nfs_lock(t=1, r=200:200, cmd=6, pid=1677) = 0 NFS: END nfs_lock(t=1, r=100:100, cmd=6, pid=1677) = -11 3 = 3! Now since the lock is seen after the process dies (with ldisp), I guess lock gets registered in file_lock right when process is deing but after the locks are cleaned up. But now the confusing part is when exactly that happens, and if it really happens. ldisp shows that file_lock has the lock entry for the process, by my logs don't have anything when exactly that entry is made. -- Additional comment from jlayton on 2006-06-17 22:28 EST -- It's not clear to me that ldisp displaying this lock actually means that there is a file_lock entry on the client. From what I can tell, that info comes from the server. I've given this some more thought... Another possible scenario for this situation might be this: Process requests lock and server grants it, but, before the reply makes it back to the client, the process is killed. So, it never calls locks_insert_lock to put it in the list, but the server considers the lock as granted. If this is the case, some possible fixes might be: 1) have lockd send an NLM_DENIED when it receives a NLM_GRANTED message for a process that is PF_EXITING. ...or, perhaps a more sure way... 2) have the process add the lock to the file_lock list before the request ever goes out on the wire. If it gets denied, then the process could just remove it. With this, that should help ensure that locks_remove_posix will know about the lock and queue up an UNLOCK request for it. I *think* this would be OK, though we might end up with some spurious UNLOCK requests if (for instance) the process was killed while waiting for a lock response from the server was going to be a DENIED. The only catch here is that I'm not sure how we could do this without affecting the code for locks on local filesystems. -- Additional comment from agud on 2006-06-17 23:06 EST -- >> Process requests lock and server grants it, but, before the reply makes it back >> to the client, the process is killed. So, it never calls locks_insert_lock to >> put it in the list, but the server considers the lock as granted. ldisp showing the lock, surely means that its hanging on the inode->i_flock. But I'm not sure if its really added to file_lock_list at any point. Jeff, I'm thinking of similar work-around as your (2) above. With a variation that, all locks will be added to inode->i_flock and to file_lock_list as soon as they are issued, but before they are accepted. They will be marked NOT_GRANTED. Whenever they are actually granted, the flag will be cleared. While cleaning up, UNLOCK would be initiated for these locks even if the flag is set. With this I think even locks with other filesystems would work ok. But other thing I want to look for next is that if the lock entry is really added to file_lock_list, and at what point it is added to inode->i_flock. -- Additional comment from jlayton on 2006-06-18 09:17 EST -- Created an attachment (id=131118) patch to test theory in comment #42 Here's a small patch to test the theory in comment #42. I added a new PF_ flag that gets set just before entering the loop to request the lock. We then clear it as we exit nlmclnt_lock function. We then throw a printk if the flag is set near the end of locks_remove_posix. I'm not certain that there won't be false positives from this, or if we might end up with locks_remove_posix getting called from lockd somehow, but if there is a dangling lock due to the situation in comment #42, then I think this will catch it. Building a kernel with this now -- I'll plan to test it either later today or tomorrow... -- Additional comment from jlayton on 2006-06-18 11:23 EST -- Well, I reproduced it twice on RHEL4, and didn't see the message I thought I would, but I did see this -- once for each time that I reproduced it: nlmclnt_lock: VFS is out of sync with lock manager! -- Additional comment from agud on 2006-06-18 22:27 EST -- Created an attachment (id=131125) debug patch 2 With the attached patch, I think its clear now that on client side the lock is not recorded on file_lock_list. So the problem could possibly be resolved by comments #42 or #43. -- Additional comment from agud on 2006-06-20 09:53 EST -- Created an attachment (id=131196) debug patch 3 I was trying to send UNLOCK requests to the process' request which are hanging off fl_block of the file_lock and are PF_EXITING, when clearing the locks (locks_remove_posix). But fl_block is empty (and rightly so, because we are not giving F_SETLKW, but either F_SETLK or F_UNLCK, realized later). -- Additional comment from agud on 2006-06-21 09:35 EST -- Created an attachment (id=131281) fix 1 Super Quick Fix. I'm sure there can be some better way to fix this. The problem is the fcntl/fclose race, which is handled, but doesn't work. After returning from posix_lock_file in fcntl_setlk, the error due to lock conflicts detected in posix_lock_file did not cause the lock recently acquired to be revoked. I changed the code and I haven't seen the bug for many (> 10) runs now. This is on RHEL3 on pro1 mounting gfiler. -- Additional comment from staubach on 2006-06-21 10:01 EST -- "but doesn't work"? How doesn't it work and why? Changing the code to release a lock which just not acquired does not seem to make sense. That's what this "Super Quick Fix" seems to do. Also, does it reintroduce the fcntl/close race? -- Additional comment from agud on 2006-06-21 10:38 EST -- My understanding is that, for the case at hand, lock is granted by the server and nfs_lock returns successfully. But posix_lock_file detects a conflict and returns -EAGAIN and does not record the lock granted by the server (in inode->i_flock). This lock goes unrecorded. So my intention was, if posix_lock_file fails for some reason, release the lock which was just acquired (posix_lock_file was executed only because nfs_lock granted the lock). Current code doesn't handle this case. The error returned by the posix_lock_file is not conveyed to the server, which is OK for F_UNLCK, but not OK otherwise. Even if we check in posix_lock_conflict to see if the conflicting process is PF_EXITING, I'm not sure how are we going to record the acquired lock without letting filp_close clean up the inode->i_flock. -- Additional comment from staubach on 2006-06-21 10:51 EST -- Why would the server grant a lock, which apparently did not conflict with other locks, but then the client reject the lock due to a conflict? -- Additional comment from agud on 2006-06-21 16:09 EST -- Created an attachment (id=131319) debug messages Bug resolution. Cause: Race between two processes, P1 and P2, for updating the i_flock field (I'd call it unlock/lock race and not neccessarily fcntl/close race). This sequence is the reason of the bug: --- P1 requests server for the lock Server grants the lock P1 sends release request for the lock Server grants the release request.. ...But before this release response reaches P1 and i_flock is updated, P2 gets going.. P2 request server for the same lock Server grants the lock P2 tries to update the i_flock field, but find conflicts BUG: error goes undetected P1 runs and updates i_flock --- The bug is that, conflict is detected but server is not sent the unlock request, and server thinks it has granted the lock (thats the reason why ldisp shows the lock being held by a non-existing process). Resolution: Handle the error returned by posix_file_lock and send unlock if posix_file_lock returns an error, then be it for any cause...(no memory for lock structures, locks not POSIX at all or conflicts) Now the question is, this is fairly VFS bug, so why this bug is not seen on other filesystems? My answer to this would be, most filesystems DON'T implement their own lock f_ops operation, and so it doesn't matter if the file-specific lock is not called if lock conflicts occur. Primary reason for the bug is: - call fs->lock(LOCK) - check for conflicts - AND _don't_ call fs->lock(UNLOCK) if conflicts are present I'd like to test for this bug in any filesystem which implement its own lock f_ops operation. For now, the fix works for RHEL3, will need to see if this works on other kernels. AG -- Additional comment from staubach on 2006-06-21 16:15 EST -- This seems like a serialization sort of error and one that system should handle transparently. Sending an unlock to the server when an error is received seems like a brute force way of handling this and that there should be a better way. I don't think that the server should be sent the unlock request because the lock should exist. The conflict is only in the client's accounting of the data structures. -- Additional comment from agud on 2006-06-21 18:41 EST -- Well, that is what the comment and the code (which I removed) intend to do - send the unlock on conflict. At most we can put the acquired lock in the fl_block list for later to be woken up when the i_flock is cleaned up. I'm trying to find a catch for this and why this was not done in the first place. Also I'm looking how this could be done in 2.6 kernel, where things have changed a bit. -- Additional comment from agud on 2006-06-21 20:57 EST -- I think the fix is apt and appropriate, for blocking calls the code hangs the process off fl_block and for non-blocking ones we don't have any option but to return the error code BUT after releasing the lock that was just acquired. -- Additional comment from staubach on 2006-06-22 08:40 EST -- Without some more information, I don't agree. The lock was granted by the server, so clearly there is no conflict, despite whatever the client is showing in this situation. Releasing a lock which was granted and returning an error to the application just seems wrong. I'd like to see the real problem addressed and not a symptom. -- Additional comment from agud on 2006-06-23 09:54 EST -- > Releasing a lock which was granted and returning an error to the application > just seems wrong. The lock has to be released for following reasons: - its a non-blocking lock - we cannot record the lock anywhere else - the lock we acquired has conflicts and resolving those conflicts may take cycles and the process doesn't want to wait - we cannot wait until i_flock is cleared (which may well be because of a bug in other part of the system and not neccessarily a lock/unlock race) - and since we cannot wait, server has to be informed If we are not to send the unlock request, the question is how are we going to resolve the conflict (update the i_flock) without blocking the process? -- Additional comment from staubach on 2006-06-23 10:04 EST -- I might suggest checking for the conflict ahead of time in order to avoid this situation? Once again, this feels to me to be addressing a symptom of a synchronization problem on the client. It would be better to correct the checking and synchronization in order to avoid this situation in the first place. -- Additional comment from steved on 2006-06-26 07:03 EST -- Created an attachment (id=131521) proposed patch The following patch I think more directly address the problem without effecting other race conditions. Whenever the filesystem allocates the lock but it can not be recored (due to any reason) the lock has to be cleaned up. Now better synchronization probably would help in this case and should be pursed in the RHEL4 world (assuming this bug exists), but whenever a lock is allocated and not recored is a very bad thing regardless if its a synchronization issue or not... So it s not clear that fixing the synchronization would be a "catch all" as I hope this patch is... -- Additional comment from imcleod on 2006-06-29 10:56 EST -- I've complied a 2.4.21-37.EL SMP kernel with the patch from comment #59 and re-run our tests. I'm afraid I am still able to consistently reproduce stale locks. -- Additional comment from agud on 2006-06-29 12:24 EST -- I tried hard on 2.4.21-44.EL-ag SMP + the steved's fix again, but I could not reproduce the bug. I'm going to try on 2.4.21-37.EL SMP.
I gave another go at reproducing this again today. I'm unable to do so on any recent RHEL4 kernel. I'm going to close this as NOTABUG and presume that my earlier testing that showed this bug in RHEL4 was a mistake or that we have inadvertently fixed this with some other patch since that testing.