Bug 449903
Summary: | nfsd not stopping for ha-linux heartbeat | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Tuomo Soini <tis> | ||||
Component: | nfs-utils | Assignee: | Steve Dickson <steved> | ||||
Status: | CLOSED WONTFIX | QA Contact: | yanfu,wang <yanwang> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 5.5 | CC: | bfields, fleite, jlayton, notting, sean, sprabhu, tis | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-01-21 20:25:18 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Tuomo Soini
2008-06-04 08:19:42 UTC
Created attachment 308328 [details]
Patch which seems to fix stopping nfsd
hmm... ther Changed to 5.4 because this bug is still there... Changed to 5.5 because bug is still present. nfs init script is not working, it's not properly stopping nfs service. This bug is still present in CentOS 5.4. I ran into it today, and was lucky enough to get help with it. looking at fs/nfsd/nfssvc.c:nfsd().... SIGTERM seems to be ignored by nfsd, so it's not surprising that killproc -2 doesn't work. (I'm not sure why ever has.) But: why not just use "rpc.nfsd 0" to shut down nfsd? look at the killproc init function in /etc/rc.d/init.d/functions when the -2 argument is given the rpc.nfsd will be kill with a SIGINT(2) with is one of the SHUTDOWN_SIGS nfsd looks for. SIGTERM is not one of the SHUTDOWN_SIGS so I believe nfsd ignores it, which is the reason we added the -2. killproc is the "approved" way of bring down daemons as well as it knows how to graphically show that status of where the daemon was or was not brought down... Doing a little experimenting it turns out that going the killproc w/out the -2 does work but takes significantly longer because killproc will first kill all the process with SIGTERM, which is ignored. Then killproc, after 3 seconds, noticing the processes are still alive will kill them is a SIGKILL, which does bring them down. While with killproc -2 all the process come down immediately Apologies, Steve, I had TERM and INT backwards--you're correct! So something strange is going on. I wonder if the extra delay that you describe covers up some race which the reporter is seeing? Confirmed that "service nfs stop" works fine, but haven't tried to set up hearbeat yet. > There seems to be 2 bugs, > 1) The init script returns saying the nfsd has been stopped successfully even > when it has not stopped. I was not able to reproduce this with the latest nfs-utils. > 2) The nfsd is not responding to the killproc nfs -2, in Rhel 4.5 the init > script did not use -2 and it worked OK Again in later releases of RHEL4, the killproc nfsd -2 is used to bring down the nfsd processes. You don't seem to understand the problem. Problem is that if you define signal for killproc it means killproc will only send requested signal and won't wait for pid to vanish before exiting. We when singal -2 is given to killproc, it will send signal and exit. That does not mean nfsd was actually taken down so if you run: /sbin/service nfsd stop ; /sbin/service nfsd status, you'll see nfsd still running because it did not exit yet! So it's vital to wait for nfsd to exit before exiting script. killproc without arguments try to make sure program really did exit before killproc returns. That's the problem. Oh. and there is special handling for pid file removal with killproc without signal argument. If signal is needed it need to be followed by normal run of killproc without extra argument or you need to copy killproc functionality into initscript. > /sbin/service nfsd stop ; /sbin/service nfsd status, you'll see nfsd still > running because it did not exit yet! If one of the nfsd processes are busy... this is normal and expected... > Oh. and there is special handling for pid file removal with killproc without > signal argument. Right... killproc does a kill -SIGTERM (which is ignored) then does a kill -SIGKILL if the processes are still alive.. > If signal is needed it need to be followed by normal run of killproc without > extra argument or you need to copy killproc functionality into initscript. How would the script know the ha-linux heartbeat is even up and running? This type of functionality is only needed when the heartbeat active, correct? No. It's required functionality for any script. if stop action finishes it must be fully stopped. OK, makes sense. So it sounds like our choices are: - Teach nfsd to stop ignoring SIGTERM. - Teach killproc to send something other than SIGTERM initially, while still behaving as it does without a signal specified on the commandline. - First do a killall -2 nfsd, then a killproc? (Is that safe?) Or maybe we should do an "rpc.nfsd 0", then a killproc? > - Teach nfsd to stop ignoring SIGTERM.
Just because nfsd gets signal, does not be the
process will die immediately, what happens if
nfsd is tied in some type of I/O? The process
will hang until the process is over, even with
ad SIGKILL, true?
I think need something like this
--- nfs.init 2009-04-24 15:26:27.000000000 -0400
+++ /tmp/nfs 2010-09-20 14:58:36.000000000 -0400
@@ -23,7 +23,21 @@
[ -z "$RQUOTAD" ] && RQUOTAD=`type -path rpc.rquotad`
RETVAL=0
-
+waittodie () {
+ pids=$1
+ alive=1
+
+ while [ -n "$pids" -a $alive -eq 1 ]
+ do
+ alive=0
+ for p in $pids
+ do
+ if checkpid $pid; then
+ alive=1
+ fi
+ done
+ done
+}
# See how we were called.
case "$1" in
start)
@@ -120,6 +134,8 @@ case "$1" in
echo
echo -n $"Shutting down NFS daemon: "
killproc nfsd -2
+ waittodie nfsd
+
echo
if [ -n "$RQUOTAD" -a "$RQUOTAD" != "no" ]; then
echo -n $"Shutting down NFS quotas: "
The right thing to do here is fix your scripts to run 'rpc.nfsd 0' and then wait for the nfsd threads to go down. Signal handling with kthreads is tricky business and is best avoided. Looking at the definition of killproc, it appears to already do something like that. So, yeah, probably dropping the -2 and adding an 'rpc.nfsd 0' before it is the way to go. The problem I see with using 'rpc.nfsd 0' is we will be creating a race.... 'rpc.nfsd 0 ; killproc nfsd' will cause killproc to report a failure since there not any nfsd process to kill... similar to 'service nfs stop ; service nfs stop' I don't understand the comment about "Signal handling with kthreads..." All waittodie does is used the standard init tools to ensure all the nfsd process are gone... Ok.. I just verified that doing 'rpc.nfsd 0 ; killproc nfsd' does indeed cause the script to fail, because the killproc can not find any nfsd process... cc-ing Bill Nottingham Bill, is there anything in the init functions that would guarantee process are kill before the init scrip moves on? Would you consider added a function like waittodie() (See Comment 19) /etc/rc.d/init.d/functions I stand corrected. I forgot that svc_set_num_threads still uses send_sig() to bring down the threads: /* destroy old threads */ while (nrservs < 0 && (task = choose_victim(serv, pool, &state)) != NULL) { send_sig(SIGINT, task, 1); nrservs++; } ...so sending a SIGINT from userspace should be fine. IIRC we kept using signals there so that we could bring down the threads in parallel. kthread_stop is synchronous, and we didn't want to wait on them to come down one at a time. Still though, the most "future proof" method is probably to do a "rpc.nfsd 0" and just wait for the threads to come down using something like Steve's waittodie(). I'd do some changes to waittodie function. 1. There must be some limit how long to wait for. 2. When process is still running when timeout happens, I'd add killproc nfsd -9 || : run to make absolutely sure process is getting stopped. > 1. There must be some limit how long to wait for. There is no guarantee on how long it can take for any pending IO to complete and then unblock a disk-sleeping nfsd. So this limit should be established accepting the fact it may still be running (though likely to stop as soon as it can process signals). > 2. When process is still running when timeout happens, I'd add > killproc nfsd -9 || : > run to make absolutely sure process is getting stopped. The problem is that it doesn't make it absolutely sure. Again, waiting on some stalled IO can take long and will not answer to kill -9. Perhaps one should just assume that sometimes the service will not stop right away and work with that fact on the rest of the system. nfsd is a kernel thread -- sending it a SIGKILL isn't any more of a "sure kill" than a SIGINT. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. (In reply to comment #26) > > 1. There must be some limit how long to wait for. > > There is no guarantee on how long it can take for any pending IO to complete > and then unblock a disk-sleeping nfsd. So this limit should be established > accepting the fact it may still be running (though likely to stop as soon as it > can process signals). > > > 2. When process is still running when timeout happens, I'd add > > killproc nfsd -9 || : > > run to make absolutely sure process is getting stopped. > > The problem is that it doesn't make it absolutely sure. Again, waiting on some > stalled IO can take long and will not answer to kill -9. > > Perhaps one should just assume that sometimes the service will not stop right > away and work with that fact on the rest of the system. I agree with this... This is the way it has worked for a long time now... Making these type of changes this late in the release is not a good idea... IMHO.. Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. |