Description of problem: Consider a system system1 which acts as nfs server. It also acts as nfs client for some other machines. If nfsd on system1 is killed , lockd still remains as it it required by nfs client on system1. After some time if nfsd is restarted on system1 (lockd is already running) , lockd does not enter grace period. Version-Release number of selected component (if applicable): nfs-utils-1.0.6-46 How reproducible: always Steps to Reproduce: 1.have a machine(system1) which acts as nfs server. It also acts as client for some other nfs server. 2.Allow clients of system1 to get some locks. 3.kill nfsd on system1. 4.After some time(this interval should be greater than grace period of lockd) restart nfsd on system1. (As lockd is already running it wont be restarted) lockd does not enter grace period. Actual results: Lock reclaiming may not occur as expected as lockd does not enter grace period Expected results: When nfsd restarts , lockd should enter grace period irrespective of whether or not lockd restarts(i:e it could be already runnning) Additional info:
how are you determining that the locks have entered their grace period? Are you looking for SM_NOTIFY rpc transactions on the network? nfsd just implements the stateless nfs protocol. I think you need to kill rpc.statd and rpc.lockd to send the NSM into its grace period, and have those SM_NOTIFY transactions sent. Can you re-run your test by killing rpc.statd and rpc.lockd, to see if you get the same results. Thanks
1.Its not possible to kill lockd as same machine acts as nfs client and nfs server. 2.lockd is used by both nfs client and server. 3.I am looking at log messages(with full debugging on),tethereal output (for network trqnsaction) and server side /proc/locks file to check whether lockd is entering grace period or not. Ideally,lockd should enter grace period whenever nfsd restarts.
how are you killing nfsd? are you running service nfs stop? are you sending it a signal via kill manually? If so, which signal? I've done some research, and simply restarting nfsd is not necesscarily enough for lockd to enter a grace period. Only if a filesystem is unexported will lockd enter a grace period for locks held on those exports. Depending on how nfsd is stopped or killed, those filesystems may or may not be unexported.
This problem is no longer there. It existed only on RHEL4 rc1. Its fixed in RHEL4 GA.
QE ack.
Action for Symantec: Is this issue still exhibited in the RHEL4 U4 beta?
This problem was fixed in RHEL4 GA. Do you still want me to verify this on RHEL4 U4 ? If yes, i will verify and post the result by monday
Wow, so why is this issue still open per chance? Are there any actions for Red Hat on this, or can this be closed? Thanks!
Yes, this can be closed (as per my testing on RHEL4 GA)
Sorry for the confusion. After going through issue 69397 , i realized that this problem reappeared on RHEL4 U2 GA. I will verify if this still exists and will send an update.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this enhancement by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This enhancement is not yet committed for inclusion in an Update release.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
Re-examine this bugzilla for RHEL 4.6 update. Apparently somewhere in the long discussions, we had mis-understandings that mixed up nfs failover issues with this bugzilla's original problem statement. Correct me if I'm wrong - but the issue described here is that "If nfsd is restarted, lockd does not get into grace period" (let's set aside server or client issues for a moment). One question we have to ask, that is, "why does lockd need to get into grace period in this case" ? Note that lock states are stored under lockd context. NFSD is stateless. When NFS lock requests come in, its states are saved in the lists managed by lockd, then directly routed to VFS and filesystem layers where posix locks are handled. NFSD plays *no* role in this operations other than handing out File Handle in the earlier open calls. *However*, File Handle is (supposedly) consistent (i.e., re-usable) across different NFSD restarts (it should even stay usable across different reboots .. [see foot-note-1]). So between different nfsd restarts, lockd should be able to handle locks correctly without getting into grace period. So before we go ahead to mess with the current design and lockd flow, I have to ask, "what is broken from application point of view if we don't "fix" this issue" (if we have anything to fix at all) ? Note that Linux lockd has been having issues with various things. We would like to fix them. However, I doubt the problem described in this bugzilla is one of these. In short, we would like to know the "exact" problem the applicatin will have if we keep this "design" ? [foot-note-1] File Handle could become stale after reboot - e.g., the device hosting the filesystem ends with different device IDs between reboots. It, however, is a well known Linux nfs issue that has an "official" workaround - that is, use "fsid" export option. So this is really not relevant to this bugzilla at all. [foot-note-2] In the case where nfsd is stopped, someone could try to umount the File System (mistakenly thinks this is a clean way to do business). However, if there is any lock still held by NFS client, the Filesystem would not be able to get umounted due to lock reference count. So nothing will be broken here.
In short, when nfsd is down and lockd is up, it implies you are still able to do locking operations (as long as the filesystem is mounted) but you can't do file IO operations. This logic is consistent with the design principles where lockd and nfsd are seperate kernel modules. Of course when we go to NFS V4, this statement will be no longer valid. But I assume we're talking about NFS V3 here ?
Well, click mouse too soon. Note that "file IO" includes file "open" call. So if you have files opened already, you probably can still do file locks. If the applications try to open new files, it will fail if nfsd is not there. So the problem described in the original problem statement should be labelled as "working as design". The real issue here is to understand what kinds of difficulties the applications will have if we don't change this "design" ?
> One question we have to ask, that is, "why does lockd need to get into grace > period in this case" ? So clients can reclaim their locks which will happen when statd sends out a "I've just rebooted" message (i.e. SM_NOTIFY). > Note that lock states are stored under lockd context. NFSD is stateless. Not quite, the state is store in on disk under statd'd control. The context of that state is stored under lockd.
And yes, we do have patches that allow admin to manually 1. force lockd to drop locks 2. forct statd to send SM_NOTIFY to selected clients 3. force lockd to get into grace period. However, it would be better to understand why the situation described in this bugzilla needs lockd to enter grace period. IMO, Linux statd (or NFS statd in general) is not a reliable (protocol) implementation. We need to handle the churning of these daemons with cares.
RHEL 4.6 development window has closed, target R4.7 for any code changes.
Anyone from Veritas care to comment on this? There has been little comments from the reporters on this. Is this still an issue?
If no updated from the reporter for one more week, I would like to close this as a NOTABUG.
Ram - not sure if Kiran is still around. Please note Comment #67...
Close this as a NON-BUG. Note that lockd and nfsd are seperate components. When nfsd dies, there is no reason for lockd to restart. If anyone has needs to alter this behavior, please do pass the ratinale and we'll discuss the possibility.
I had updated this thread on 08/05, I guess my comment wasnt commited. So, I had a 2-systems configuration which acts as both server and client configured with rhel4u4. As wcheng, mentioned killing nfsd didnt have any effect on lockd and the client on the other system was able to hold on to its locks.But I came across another issue, when I rebooted the server which issued a lock to the client, after the server came back, the client couldnt reclaim its locks. This event sent from IssueTracker by gcase issue 69397
Gary, These are very *different* issues since "reboot" and "manually bringing down nfsd" are different things. In reboot case, lockd *restarts*. In theory, upon lockd restart, the system gets into a default (tunable) 90 seconds grace period to allow clients to reclaim the lock. "The client couldn't reclaim its locks" is a very vague statement. I don't have RHEL4 system handy. If you can recreate the issue in lab, I could take a look (if you want to try this, give me a call tomorrow to get instructions if you want). Or ask your customer to 1, Turn on NLM debugging on *both* machines - in 2.6 kernel (RHEL5), it can be done by "echo 32767 > /proc/sys/sunrpc/nlm_debug". RHEL4 probably has the same thing. 2. Re-create the issue and send /var/log/messages files from *both* systems over. If the claim is true, then it is certainly a bug and we'll try to fix it. A seperate bugzilla is more proper.
Gary.. sorry... I was occupied by another issue. Please add (3) into above list: 3. Make sure statd is running upon reboot (ps -ef | grep statd). If not, doing a manual "service nfslock restart" and watch to see whether there is any abnormal message in /var/log/messages file.
Latest update from Gary ... Gary Case wrote: > > After vcslinux119 comes up > -------------------------- > Aug 21 19:29:47 vcslinux119 nfs: Starting NFS services: succeeded > Aug 21 19:29:47 vcslinux119 nfs: rpc.rquotad startup succeeded > Aug 21 19:29:47 vcslinux119 kernel: Installing knfsd (copyright (C) 1996 okir.de). > Aug 21 19:29:47 vcslinux119 nfs: rpc.nfsd startup succeeded > Aug 21 19:29:47 vcslinux119 nfs: rpc.mountd startup succeeded > Aug 21 19:29:48 vcslinux119 rpcidmapd: rpc.idmapd -SIGHUP succeeded ===============<Logs of vcslinux120> > Aug 21 19:35:29 vcslinux120 kernel: lockd: server returns status 7 > Aug 21 19:35:29 vcslinux120 kernel: lockd: failed to reclaim lock for pid 6920 (errno 0, status 7) > Aug 21 19:35:29 vcslinux120 kernel: lockd: release host 10.212.102.42 > > > We got stale file handle (status 7) ... check the filesystem device name (/dev/sd-(something)), make sure they come up with the same name between reboot (e.g., if before reboot is /dev/sdb, after reboot is /dev/sdb too). It is hard to ensure this happens on RHEL4. One way to avoid this issue is using "fsid" option in nfs export. For example: [root@dhcp148 linux-touch]# cat /etc/exports /server *(fsid=1,async,rw) /mnt/lockd/exports *(fsid=2,async,rw) /sfs1/sfs1 *(fsid=3,async,rw) If the customer needs a short tutorial on "fsid" usage. let me know. -- Wendy