Bug 146773 - [Symantec RHEL4.6 bug] lockd does not enter grace period if nfsd restarts on machine acting as both nfs server and nfs client
Summary: [Symantec RHEL4.6 bug] lockd does not enter grace period if nfsd restarts on ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Wendy Cheng
QA Contact: Ben Levenson
URL:
Whiteboard:
Depends On: 245197
Blocks: 176344 198694 199586 217201 234251 245198
TreeView+ depends on / blocked
 
Reported: 2005-02-01 15:53 UTC by kiran mehta
Modified: 2018-10-19 20:57 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-08-10 18:16:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description kiran mehta 2005-02-01 15:53:10 UTC
Description of problem:
Consider a system system1 which acts as nfs server.
It also acts as nfs client for some other machines.
If nfsd on system1 is killed , lockd still remains as
it it required by nfs client on system1.
After some time if nfsd is restarted on system1
(lockd is already running) , lockd does not enter
grace period.


Version-Release number of selected component (if applicable):
nfs-utils-1.0.6-46

How reproducible:
always


Steps to Reproduce:
1.have a machine(system1) which acts as nfs server.
  It also acts as client for some other nfs server.
2.Allow clients of system1 to get some locks.
3.kill nfsd on system1.
4.After some time(this interval should be greater than grace period   
  of lockd) restart nfsd on system1.
  (As lockd is already running it wont be restarted)
  lockd does not enter grace period.
  
Actual results:
Lock reclaiming may not occur as expected as lockd
does not enter grace period

Expected results:
When nfsd restarts , lockd should enter grace period
irrespective of whether or not lockd restarts(i:e it
could be already runnning)

Additional info:

Comment 5 Neil Horman 2005-03-30 21:32:53 UTC
how are you determining that the locks have entered their grace period?  Are you
looking for SM_NOTIFY rpc transactions on the network?  nfsd just implements the
stateless nfs protocol.  I think you need to kill rpc.statd and rpc.lockd to
send the NSM into its grace period, and have those SM_NOTIFY transactions sent.
 Can you re-run your test by killing rpc.statd and rpc.lockd, to see if you get
the same results.  Thanks

Comment 6 kiran mehta 2005-03-31 04:35:33 UTC
1.Its not possible to kill lockd as same machine acts as nfs client and nfs 
  server.
2.lockd is used by both nfs client and server.
  
3.I am looking at log messages(with full debugging on),tethereal output
 (for network trqnsaction) and server side /proc/locks file to check whether 
  lockd is entering grace period  or not.


Ideally,lockd should enter grace period whenever nfsd restarts.

Comment 7 Neil Horman 2005-04-04 15:41:18 UTC
how are you killing nfsd?  are you running service nfs stop?  are you sending it
a signal via kill manually?  If so, which signal?

I've done some research, and simply restarting nfsd is not necesscarily enough
for lockd to enter a grace period.  Only if a filesystem is unexported will
lockd enter a grace period for locks held on those exports.  Depending on how
nfsd is stopped or killed, those filesystems may or may not be unexported.

Comment 11 kiran mehta 2005-04-20 12:55:35 UTC
This problem is no longer there.
It existed only on RHEL4 rc1.
Its fixed in RHEL4 GA.

Comment 14 Jay Turner 2005-06-21 15:14:29 UTC
QE ack.

Comment 22 Andrius Benokraitis 2006-06-20 18:16:11 UTC
Action for Symantec: Is this issue still exhibited in the RHEL4 U4 beta?

Comment 23 kiran mehta 2006-06-21 04:55:11 UTC
This problem was fixed in RHEL4 GA. Do you still want me to verify this on
RHEL4 U4 ? If yes, i will verify and post the result by monday

Comment 24 Andrius Benokraitis 2006-06-21 05:06:16 UTC
Wow, so why is this issue still open per chance? Are there any actions for Red
Hat on this, or can this be closed? Thanks!

Comment 25 kiran mehta 2006-06-21 05:26:29 UTC
Yes, this can be closed (as per my testing on RHEL4 GA)

Comment 26 kiran mehta 2006-06-21 12:58:40 UTC
Sorry for the confusion. After going through issue 69397 , i realized that
this problem reappeared on RHEL4 U2 GA. I will verify if this still exists
and  will send an update.


Comment 27 kiran mehta 2006-06-21 12:59:50 UTC
Sorry for the confusion. After going through issue 69397 , i realized that
this problem reappeared on RHEL4 U2 GA. I will verify if this still exists
and  will send an update.


Comment 30 RHEL Program Management 2006-08-16 21:06:45 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this enhancement by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This enhancement is not yet committed for inclusion in an Update
release.

Comment 41 RHEL Program Management 2006-09-12 13:45:23 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request. 

Comment 59 Wendy Cheng 2007-07-03 18:21:48 UTC
Re-examine this bugzilla for RHEL 4.6 update. Apparently somewhere in the long
discussions, we had mis-understandings that mixed up nfs failover issues with
this bugzilla's original problem statement. Correct me if I'm wrong - but the
issue described here is that

"If nfsd is restarted, lockd does not get into grace period" (let's set aside
server or client issues for a moment).

One question we have to ask, that is, "why does lockd need to get into grace
period in this case" ? Note that lock states are stored under lockd context. 
NFSD is stateless. When NFS lock requests come in, its states are saved in 
the lists managed by lockd, then directly routed to VFS and filesystem layers
where posix locks are handled. NFSD plays *no* role in this operations other
than handing out File Handle in the earlier open calls. *However*, File 
Handle is (supposedly) consistent (i.e., re-usable) across different NFSD
restarts (it should even stay usable across different reboots .. [see 
foot-note-1]). So between different nfsd restarts, lockd should be able to
handle locks correctly without getting into grace period.

So before we go ahead to mess with the current design and lockd flow, I have
to ask, "what is broken from application point of view if we don't "fix" this
issue" (if we have anything to fix at all) ?

Note that Linux lockd has been having issues with various things. We would
like to fix them. However, I doubt the problem described in this bugzilla is
one of these.

In short, we would like to know the "exact" problem the applicatin will have
if we keep this "design" ? 

[foot-note-1] File Handle could become stale after reboot - e.g., the device
hosting the filesystem ends with different device IDs between reboots. It,
however, is a well known Linux nfs issue that has an "official" workaround -
that is, use "fsid" export option. So this is really not relevant to this 
bugzilla at all.
[foot-note-2] In the case where nfsd is stopped, someone could try to umount
the File System (mistakenly thinks this is a clean way to do business). 
However, if there is any lock still held by NFS client, the Filesystem would
not be able to get umounted due to lock reference count. So nothing will be
broken here. 

Comment 60 Wendy Cheng 2007-07-03 18:31:28 UTC
In short, when nfsd is down and lockd is up, it implies you are still able to
do locking operations (as long as the filesystem is mounted) but you can't do
file IO operations. This logic is consistent with the design principles where
lockd and nfsd are seperate kernel modules. 

Of course when we go to NFS V4, this statement will be no longer valid. But I
assume we're talking about NFS V3 here ? 

Comment 61 Wendy Cheng 2007-07-03 18:52:48 UTC
Well, click mouse too soon. Note that "file IO" includes file "open" call.
So if you have files opened already, you probably can still do file locks. 
If the applications try to open new files, it will fail if nfsd is not there. 
So the problem described in the original problem statement should be labelled 
as "working as design". 

The real issue here is to understand what kinds of difficulties the applications
will have if we don't change this "design" ? 

Comment 62 Steve Dickson 2007-07-04 11:02:34 UTC
> One question we have to ask, that is, "why does lockd need to get into grace
> period in this case" ? 
So clients can reclaim their locks which will happen when statd
sends out a "I've just rebooted" message (i.e. SM_NOTIFY).

> Note that lock states are stored under lockd context.  NFSD is stateless.
Not quite, the state is store in on disk under statd'd control. 
The context of that state is stored under lockd.


Comment 64 Wendy Cheng 2007-07-04 14:11:42 UTC
And yes, we do have patches that allow admin to manually 

1. force lockd to drop locks
2. forct statd to send SM_NOTIFY to selected clients
3. force lockd to get into grace period.

However, it would be better to understand why the situation described in this
bugzilla needs lockd to enter grace period. IMO, Linux statd (or NFS statd in
general) is not a reliable (protocol) implementation. We need to handle the
churning of these daemons with cares.

Comment 65 Peter Martuccelli 2007-07-25 20:49:03 UTC
RHEL 4.6 development window has closed, target R4.7 for any code changes.

Comment 66 Andrius Benokraitis 2007-07-26 13:23:42 UTC
Anyone from Veritas care to comment on this? There has been little comments from
the reporters on this. Is this still an issue?

Comment 67 Wendy Cheng 2007-08-01 21:11:23 UTC
If no updated from the reporter for one more week, I would like to close this as
a NOTABUG.

Comment 68 Andrius Benokraitis 2007-08-01 21:17:28 UTC
Ram - not sure if Kiran is still around. Please note Comment #67...

Comment 69 Wendy Cheng 2007-08-10 18:16:00 UTC
Close this as a NON-BUG. Note that lockd and nfsd are seperate components. 
When nfsd dies, there is no reason for lockd to restart. 

If anyone has needs to alter this behavior, please do pass the ratinale and
we'll discuss the possibility. 

Comment 70 Issue Tracker 2007-08-16 21:53:56 UTC
I had updated this thread on 08/05, I guess my comment
wasnt commited. So, I had a 2-systems configuration which acts as both
server and client configured with rhel4u4. As wcheng, mentioned killing
nfsd didnt have any effect on lockd and the client on the other system was
able to hold on to its locks.But I came across another issue, when  I
rebooted the server which issued a lock to the client, after the server
came back, the client couldnt reclaim its locks.


This event sent from IssueTracker by gcase 
 issue 69397

Comment 71 Wendy Cheng 2007-08-17 01:19:37 UTC
Gary, 

These are very *different* issues since "reboot" and "manually bringing 
down nfsd" are different things. In reboot case, lockd *restarts*. In 
theory, upon lockd restart, the system gets into a default (tunable) 90 
seconds grace period to allow clients to reclaim the lock. 

"The client couldn't reclaim its locks" is a very vague statement. I 
don't have RHEL4 system handy. If you can recreate the issue in lab, I
could take a look (if you want to try this, give me a call tomorrow 
to get instructions if you want). Or ask your customer to 

1, Turn on NLM debugging on *both* machines - in 2.6 kernel (RHEL5), 
   it can be done by "echo 32767 > /proc/sys/sunrpc/nlm_debug". RHEL4
   probably has the same thing.
2. Re-create the issue and send /var/log/messages files from *both* 
   systems over.

If the claim is true, then it is certainly a bug and we'll try to fix it.
A seperate bugzilla is more proper. 

Comment 72 Wendy Cheng 2007-08-17 02:00:54 UTC
Gary.. sorry... I was occupied by another issue. Please add (3) into above
list:

3. Make sure statd is running upon reboot (ps -ef | grep statd). If not,
   doing a manual "service nfslock restart" and watch to see whether there
   is any abnormal message in /var/log/messages file. 

Comment 73 Wendy Cheng 2007-08-22 21:13:38 UTC
Latest update from Gary ... 

Gary Case wrote:

>
> After vcslinux119 comes up
> --------------------------
> Aug 21 19:29:47 vcslinux119 nfs: Starting NFS services:  succeeded
> Aug 21 19:29:47 vcslinux119 nfs: rpc.rquotad startup succeeded
> Aug 21 19:29:47 vcslinux119 kernel: Installing knfsd (copyright (C) 1996
okir.de).
> Aug 21 19:29:47 vcslinux119 nfs: rpc.nfsd startup succeeded
> Aug 21 19:29:47 vcslinux119 nfs: rpc.mountd startup succeeded
> Aug 21 19:29:48 vcslinux119 rpcidmapd: rpc.idmapd -SIGHUP succeeded          
                                                    ===============<Logs of
vcslinux120>
> Aug 21 19:35:29 vcslinux120 kernel: lockd: server returns status 7
> Aug 21 19:35:29 vcslinux120 kernel: lockd: failed to reclaim lock for pid 6920
(errno 0, status 7)
> Aug 21 19:35:29 vcslinux120 kernel: lockd: release host 10.212.102.42
>
>  
>

We got stale file handle (status 7) ... check the filesystem device name
(/dev/sd-(something)), make sure they come up with the same name between 
reboot (e.g., if before reboot is /dev/sdb, after reboot is /dev/sdb too).

It is hard to ensure this happens on RHEL4. One way to avoid this issue 
is using "fsid" option in nfs export. For example:

[root@dhcp148 linux-touch]# cat /etc/exports
/server *(fsid=1,async,rw)
/mnt/lockd/exports *(fsid=2,async,rw)
/sfs1/sfs1      *(fsid=3,async,rw)

If the customer needs a short tutorial on "fsid" usage. let me know.

-- Wendy 


Note You need to log in before you can comment on or make changes to this bug.