Bug 59245 - Hard kernel crash results in stuck NFS locks
Summary: Hard kernel crash results in stuck NFS locks
Keywords:
Status: CLOSED DUPLICATE of bug 76065
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: nfs-utils
Version: 7.2
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Stephen Tweedie
QA Contact: Brian Brock
URL:
Whiteboard:
: 64757 (view as bug list)
Depends On:
Blocks: 67218
TreeView+ depends on / blocked
 
Reported: 2002-02-03 20:20 UTC by Havoc Pennington
Modified: 2014-01-21 22:48 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2002-07-12 21:00:15 UTC
Embargoed:


Attachments (Terms of Use)
test app for debugging this (2.03 KB, text/plain)
2002-03-06 17:24 UTC, Havoc Pennington
no flags Details

Description Havoc Pennington 2002-02-03 20:20:26 UTC
Description of Problem:

If the kernel crashes (or someone pulls the power cord) and an fcntl() 
lock is held in an NFS directory, the lock does not appear to be released
when the machine comes back up as it's supposed to be.

(I'm not sure if this is a kernel bug, don't know who actually releases the
locks on restart or where things go wrong.)

How Reproducible:

Create a file in an NFS-mounted directory; run a program that locks it
then goes to sleep holding the lock; crash system; on restart, try to re-lock
the file. Note that there's a stale lock.

Comment 1 Havoc Pennington 2002-03-03 22:35:02 UTC
Lots of people using Rawhide are encountering this because GNOME now requires
NFS locking to work correctly. If possible would like the next release to ship
with a fix; can I encourage someone to bump this issue up in the priority queue,
so I don't have to dup this bug over and over after the next release ships? ;-)
Trond and msw have both hit this bug on their workstations, for example.

Comment 2 Arjan van de Ven 2002-03-04 10:19:55 UTC
This is interesting in several ways:
1) Not all NFS servers implement fcntl locking
2) We cannot fix all NFS servers, only the Red Hat Linux ones (if at all)
3) locking over a stateless protocol is always "interesting"


Comment 3 Stephen Tweedie 2002-03-04 11:04:13 UTC
NFS locking is not a stateless protocol.  The locking is entirely separate from
NFS.  The lock protocol relies on local persistant state to make sure that on
reboot, the clients tell the server that they have entered a clean new boot: the
server should drop the client's locks at that point.  (If the client comes up
with a different IP address, this obviously fails.)

While the client is rebooting, the lock remains: there is no timeout.  

If the server reboots, the same notification mechanism is used to tell the
clients of the fact, so that clients can retake their lost locks.

I tested all of this on some 7.1/7.2 boxes about 3 or 4 months ago.  You need
the latest nfs-utils for 7.0/7.1, but 7.2 should work OK.  So, is this entirely
repeatable for you?  I'll have a dig to see why if so.

Comment 4 Havoc Pennington 2002-03-04 13:36:56 UTC
arjanv: yeah, not perfect, but I tried using userspace lock hacks, and they 
are just broken. I need a lock that is per-user, not one that's per-machine;
locking in the home dir using kernel locks is the best I can come up with.

sct: tons of people are reporting this to me, always post-hard-crash. I haven't
tried doing it over and over to see if it only fails sometimes, but it happens
a lot. It's happened twice I know with 7.2-ish clients and devserv on the 
server side, once when devserv was 7.0/7.1 and once after our upgrade to 7.2.(By
7.2-ish I mean people tend to install random rawhide bits around here.)

You say "If the client comes up with a different IP address, this obviously
fails." - I wonder if the problem is DHCP machines, as all of them here are,
including the test boxes. I hadn't thought about that before - it may well
explain the whole thing. Which puts me back at square one.

I can throw up a dialog on login that says "these locks are kind of
stale-looking do you want to delete them?" but I'm worried people will just
click "yes" and corrupt their configuration. Plus I'm not sure how to establish
"stale-looking," maybe empty output from fuser, dunno.

OK, I should test with fixed IP addresses and see if that resolves the problem.
Leaving needinfo.

Comment 5 Stephen Tweedie 2002-03-04 13:53:40 UTC
DHCP ought to keep the IP address over reboot if the lease is set to be a
reasonable length.  If the IP changes, NFS has simply no way to deal with this.

It's also possible that the nfs-utils are broken again.  The chroot patch is
very fragile and if anything in the chroot or libc breaks the resolver, rpc
statd (the mechanism for notifying hosts of reboots) will die.

Comment 6 Havoc Pennington 2002-03-06 17:24:15 UTC
Created attachment 47621 [details]
test app for debugging this

Comment 7 Havoc Pennington 2002-03-06 17:26:20 UTC
Using the attached test program, I mounted my home directory, 
ran "testlock testlockfile" which showed an IP address and 
then went to sleep holding a lock; I hit the power button, 
and after coming back up ran "testlock testlockfile" again, getting 
the same IP address, but the lock failed on EAGAIN.

Comment 8 Stephen Tweedie 2002-03-06 18:06:09 UTC
What version of nfs-utils?  What version of the kernel?  Is "rpc.statd" running?
 What are the IP addresses of host and client, and what does the contents of
/var/lib/nfs/statd/ look like on each once you have acquired the lock?

Comment 9 Havoc Pennington 2002-03-07 23:01:04 UTC
(Client is a beta2 system)

client has nfs-utils-0.3.3-2
client has kernel-2.4.18-0.1

rpc.statd _is_ running on client

client IP is 172.16.59.219
server IP is 172.16.52.28

Client /var/lib/nfs/statd while holding the lock:
 sm/172.16.58.1
 sm.bak/
 state

sm.bak is empty, file "state" contains "^A^@^@^@" (actual control chars, i.e. 4
bytes)

Server /var/lib/nfs/statd while holding lock:
 
[root@devserv root]# ls -R /var/lib/nfs/statd
/var/lib/nfs/statd:
etc  sm  sm.bak  state

/var/lib/nfs/statd/etc:
resolv.conf

/var/lib/nfs/statd/sm:
172.16.56.104  172.16.56.48  172.16.56.72  172.16.56.89  172.16.57.6
172.16.56.125  172.16.56.66  172.16.56.77  172.16.57.17  172.16.59.215
172.16.56.46   172.16.56.70  172.16.56.80  172.16.57.4   172.16.59.219

/var/lib/nfs/statd/sm.bak:
[root@devserv root]#

Client after hitting power switch, booting, and reattempting to lock
the file has extactly the same /var/lib/nfs/statd contents that it had
prior to cutting power.

Server after client has come back and reattempted the lock (but failed to get
the lock):

[root@devserv root]# ls -R /var/lib/nfs/statd
/var/lib/nfs/statd:
etc  sm  sm.bak  state

/var/lib/nfs/statd/etc:
resolv.conf

/var/lib/nfs/statd/sm:
172.16.56.104  172.16.56.48  172.16.56.80  172.16.57.4  172.16.59.215
172.16.56.125  172.16.56.66  172.16.56.89  172.16.57.6  172.16.59.219
172.16.56.46   172.16.56.72  172.16.57.17  172.16.58.2

/var/lib/nfs/statd/sm.bak:
[root@devserv root]#

Client IP is still 172.16.59.219 when it comes back up post-crash.


Comment 10 Stephen Tweedie 2002-03-08 17:54:00 UTC
So there's a big problem here.

client IP is 172.16.59.219
server IP is 172.16.52.28

Client /var/lib/nfs/statd while holding the lock:
 sm/172.16.58.1

So, the client has obtained a lock, but has not set up a monitor notification to
the server IP address --- there should be a 172.16.52.28 IP file in statd/sm for
that.  So no wonder the client isn't telling the server about its stale locks
after a reboot.

Is there anything in the client log about rpc.statd failing to monitor the server?

Comment 11 Havoc Pennington 2002-03-08 19:32:22 UTC
I don't see anything in /var/log/messages; just 
the "Version 0.3.3 starting" message.

Comment 12 Seth Vidal 2002-05-13 14:56:51 UTC
is there a reason to not push to nfs-utils 1.0 or the VERY recently released 1.0.1?

Might they not correct some of these problems?



Comment 13 Stephen Tweedie 2002-05-13 15:52:10 UTC
They have not been tested.  Upstream changes are often as likely to introduce
new problems as to fix old ones.  Auditing the new version for specific changes
which look important might be useful, though.

Comment 14 Seth Vidal 2002-05-18 02:03:37 UTC
Here is the changelog for the nfs-utils packages from the sourceforge cvsview
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/nfs/nfs-utils/ChangeLog?rev=1.170&content-type=text/vnd.viewcvs-markup


Most of the changes don't seem so painful as to be horribly break anything but
installing it then running a connecathon or fsx test on them might be worth doing.


Comment 15 Havoc Pennington 2002-07-06 23:00:41 UTC
*** Bug 64757 has been marked as a duplicate of this bug. ***

Comment 16 Stephen Tweedie 2002-11-11 22:19:55 UTC

*** This bug has been marked as a duplicate of 76065 ***


Note You need to log in before you can comment on or make changes to this bug.