Description of Problem: If the kernel crashes (or someone pulls the power cord) and an fcntl() lock is held in an NFS directory, the lock does not appear to be released when the machine comes back up as it's supposed to be. (I'm not sure if this is a kernel bug, don't know who actually releases the locks on restart or where things go wrong.) How Reproducible: Create a file in an NFS-mounted directory; run a program that locks it then goes to sleep holding the lock; crash system; on restart, try to re-lock the file. Note that there's a stale lock.
Lots of people using Rawhide are encountering this because GNOME now requires NFS locking to work correctly. If possible would like the next release to ship with a fix; can I encourage someone to bump this issue up in the priority queue, so I don't have to dup this bug over and over after the next release ships? ;-) Trond and msw have both hit this bug on their workstations, for example.
This is interesting in several ways: 1) Not all NFS servers implement fcntl locking 2) We cannot fix all NFS servers, only the Red Hat Linux ones (if at all) 3) locking over a stateless protocol is always "interesting"
NFS locking is not a stateless protocol. The locking is entirely separate from NFS. The lock protocol relies on local persistant state to make sure that on reboot, the clients tell the server that they have entered a clean new boot: the server should drop the client's locks at that point. (If the client comes up with a different IP address, this obviously fails.) While the client is rebooting, the lock remains: there is no timeout. If the server reboots, the same notification mechanism is used to tell the clients of the fact, so that clients can retake their lost locks. I tested all of this on some 7.1/7.2 boxes about 3 or 4 months ago. You need the latest nfs-utils for 7.0/7.1, but 7.2 should work OK. So, is this entirely repeatable for you? I'll have a dig to see why if so.
arjanv: yeah, not perfect, but I tried using userspace lock hacks, and they are just broken. I need a lock that is per-user, not one that's per-machine; locking in the home dir using kernel locks is the best I can come up with. sct: tons of people are reporting this to me, always post-hard-crash. I haven't tried doing it over and over to see if it only fails sometimes, but it happens a lot. It's happened twice I know with 7.2-ish clients and devserv on the server side, once when devserv was 7.0/7.1 and once after our upgrade to 7.2.(By 7.2-ish I mean people tend to install random rawhide bits around here.) You say "If the client comes up with a different IP address, this obviously fails." - I wonder if the problem is DHCP machines, as all of them here are, including the test boxes. I hadn't thought about that before - it may well explain the whole thing. Which puts me back at square one. I can throw up a dialog on login that says "these locks are kind of stale-looking do you want to delete them?" but I'm worried people will just click "yes" and corrupt their configuration. Plus I'm not sure how to establish "stale-looking," maybe empty output from fuser, dunno. OK, I should test with fixed IP addresses and see if that resolves the problem. Leaving needinfo.
DHCP ought to keep the IP address over reboot if the lease is set to be a reasonable length. If the IP changes, NFS has simply no way to deal with this. It's also possible that the nfs-utils are broken again. The chroot patch is very fragile and if anything in the chroot or libc breaks the resolver, rpc statd (the mechanism for notifying hosts of reboots) will die.
Created attachment 47621 [details] test app for debugging this
Using the attached test program, I mounted my home directory, ran "testlock testlockfile" which showed an IP address and then went to sleep holding a lock; I hit the power button, and after coming back up ran "testlock testlockfile" again, getting the same IP address, but the lock failed on EAGAIN.
What version of nfs-utils? What version of the kernel? Is "rpc.statd" running? What are the IP addresses of host and client, and what does the contents of /var/lib/nfs/statd/ look like on each once you have acquired the lock?
(Client is a beta2 system) client has nfs-utils-0.3.3-2 client has kernel-2.4.18-0.1 rpc.statd _is_ running on client client IP is 172.16.59.219 server IP is 172.16.52.28 Client /var/lib/nfs/statd while holding the lock: sm/172.16.58.1 sm.bak/ state sm.bak is empty, file "state" contains "^A^@^@^@" (actual control chars, i.e. 4 bytes) Server /var/lib/nfs/statd while holding lock: [root@devserv root]# ls -R /var/lib/nfs/statd /var/lib/nfs/statd: etc sm sm.bak state /var/lib/nfs/statd/etc: resolv.conf /var/lib/nfs/statd/sm: 172.16.56.104 172.16.56.48 172.16.56.72 172.16.56.89 172.16.57.6 172.16.56.125 172.16.56.66 172.16.56.77 172.16.57.17 172.16.59.215 172.16.56.46 172.16.56.70 172.16.56.80 172.16.57.4 172.16.59.219 /var/lib/nfs/statd/sm.bak: [root@devserv root]# Client after hitting power switch, booting, and reattempting to lock the file has extactly the same /var/lib/nfs/statd contents that it had prior to cutting power. Server after client has come back and reattempted the lock (but failed to get the lock): [root@devserv root]# ls -R /var/lib/nfs/statd /var/lib/nfs/statd: etc sm sm.bak state /var/lib/nfs/statd/etc: resolv.conf /var/lib/nfs/statd/sm: 172.16.56.104 172.16.56.48 172.16.56.80 172.16.57.4 172.16.59.215 172.16.56.125 172.16.56.66 172.16.56.89 172.16.57.6 172.16.59.219 172.16.56.46 172.16.56.72 172.16.57.17 172.16.58.2 /var/lib/nfs/statd/sm.bak: [root@devserv root]# Client IP is still 172.16.59.219 when it comes back up post-crash.
So there's a big problem here. client IP is 172.16.59.219 server IP is 172.16.52.28 Client /var/lib/nfs/statd while holding the lock: sm/172.16.58.1 So, the client has obtained a lock, but has not set up a monitor notification to the server IP address --- there should be a 172.16.52.28 IP file in statd/sm for that. So no wonder the client isn't telling the server about its stale locks after a reboot. Is there anything in the client log about rpc.statd failing to monitor the server?
I don't see anything in /var/log/messages; just the "Version 0.3.3 starting" message.
is there a reason to not push to nfs-utils 1.0 or the VERY recently released 1.0.1? Might they not correct some of these problems?
They have not been tested. Upstream changes are often as likely to introduce new problems as to fix old ones. Auditing the new version for specific changes which look important might be useful, though.
Here is the changelog for the nfs-utils packages from the sourceforge cvsview http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/nfs/nfs-utils/ChangeLog?rev=1.170&content-type=text/vnd.viewcvs-markup Most of the changes don't seem so painful as to be horribly break anything but installing it then running a connecathon or fsx test on them might be worth doing.
*** Bug 64757 has been marked as a duplicate of this bug. ***
*** This bug has been marked as a duplicate of 76065 ***