Bug 449903

Summary:

nfsd not stopping for ha-linux heartbeat

Product:

Red Hat Enterprise Linux 5

Reporter:

Tuomo Soini <tis>

Component:

nfs-utils

Assignee:

Steve Dickson <steved>

Status:

CLOSED WONTFIX

QA Contact:

yanfu,wang <yanwang>

Severity:

medium

Docs Contact:

Priority:

low

Version:

5.5

CC:

bfields, fleite, jlayton, notting, sean, sprabhu, tis

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-01-21 20:25:18 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Patch which seems to fix stopping nfsd	none

Description Tuomo Soini 2008-06-04 08:19:42 UTC

+++ This bug was initially created as a clone of Bug #395511 +++

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.8)
Gecko/20071008 Firefox/2.0.0.8

Description of problem:
When i run /etc/init.d/nfs stop from the command line the nfsd starts and stops ok.

When i use ha-linux heartbeat to control the nfs the nfsd do not stop when told
to by the init script.

There seems to be 2 bugs, 
1) The init script returns saying the nfsd has been stopped successfully even
when it has not stopped. 
2) The nfsd is not responding to the killproc nfs -2, in Rhel 4.5 the init
script did not use -2 and it worked OK

Version-Release number of selected component (if applicable):


How reproducible:
Always


Steps to Reproduce:
1.start nfs using heartbeat
2.stop nfs using heatbeat
3.

Actual Results:


Expected Results:


Additional info:

-- Additional comment from tis on 2007-12-31 09:01 EST --
Actually this same bug affect rhel4.6 too. -2 kill option was added on updated
nfs-utils package and it's unable to stop nfsd just like 5.1 version is.

This problem is not only with linux-ha heartbeat. service nfs stop will not stop
nfsd at all!

Comment 1 Tuomo Soini 2008-06-04 08:22:47 UTC

Created attachment 308328 [details]
Patch which seems to fix stopping nfsd

Comment 2 Steve Dickson 2008-06-05 19:49:04 UTC

hmm... ther

Comment 4 Tuomo Soini 2009-10-14 20:15:40 UTC

Changed to 5.4 because this bug is still there...

Comment 7 Tuomo Soini 2010-03-31 22:01:49 UTC

Changed to 5.5 because bug is still present. nfs init script is not working, it's not properly stopping nfs service.

Comment 8 densone 2010-03-31 22:25:28 UTC

This bug is still present in CentOS 5.4. I ran into it today, and was lucky enough to get help with it.

Comment 10 J. Bruce Fields 2010-09-20 14:51:55 UTC

looking at fs/nfsd/nfssvc.c:nfsd().... SIGTERM seems to be ignored by nfsd, so it's not surprising that killproc -2 doesn't work.  (I'm not sure why ever has.)

But: why not just use "rpc.nfsd 0" to shut down nfsd?

Comment 11 Steve Dickson 2010-09-20 16:46:17 UTC

look at the killproc init function in /etc/rc.d/init.d/functions
when the -2 argument is given the rpc.nfsd will be kill with a
SIGINT(2) with is one of the SHUTDOWN_SIGS nfsd looks for.

SIGTERM is not one of the  SHUTDOWN_SIGS so I believe  nfsd
ignores it, which is the reason we added the -2.

killproc is the "approved" way of bring down daemons 
as well as it knows how to graphically show that status
of where the daemon was or was not brought down...

Doing a little experimenting it turns out that 
going the killproc w/out the -2 does work but takes
significantly longer because killproc will first kill
all the process with  SIGTERM, which is ignored. Then
killproc, after 3 seconds, noticing the processes are
still alive will kill them is a SIGKILL, which does
bring them down.

While with killproc -2 all the process come down immediately

Comment 12 J. Bruce Fields 2010-09-20 17:11:15 UTC

Apologies, Steve, I had TERM and INT backwards--you're correct!

So something strange is going on.  I wonder if the extra delay that you describe covers up some race which the reporter is seeing?

Confirmed that "service nfs stop" works fine, but haven't tried to set up hearbeat yet.

Comment 13 Steve Dickson 2010-09-20 17:15:47 UTC

> There seems to be 2 bugs, 
> 1) The init script returns saying the nfsd has been stopped successfully even
> when it has not stopped. 
I was not able to reproduce this with the latest nfs-utils.

> 2) The nfsd is not responding to the killproc nfs -2, in Rhel 4.5 the init
> script did not use -2 and it worked OK
Again in later releases of RHEL4, the killproc nfsd -2 is used to bring
down the nfsd processes.

Comment 14 Tuomo Soini 2010-09-20 17:47:06 UTC

You don't seem to understand the problem. Problem is that if you define signal for killproc it means killproc will only send requested signal and won't wait for pid to vanish before exiting. We when singal -2 is given to killproc, it will send signal and exit.

That does not mean nfsd was actually taken down so if you run:

/sbin/service nfsd stop ; /sbin/service nfsd status, you'll see nfsd still running because it did not exit yet!

So it's vital to wait for nfsd to exit before exiting script.

killproc without arguments try to make sure program really did exit before killproc returns.

That's the problem.

Comment 15 Tuomo Soini 2010-09-20 17:54:08 UTC

Oh. and there is special handling for pid file removal with killproc without signal argument.

If signal is needed it need to be followed by normal run of killproc without extra argument or you need to copy killproc functionality into initscript.

Comment 16 Steve Dickson 2010-09-20 18:12:05 UTC

> /sbin/service nfsd stop ; /sbin/service nfsd status, you'll see nfsd still
> running because it did not exit yet!
If one of the nfsd processes are busy... this is normal and expected... 


> Oh. and there is special handling for pid file removal with killproc without
> signal argument.
Right... killproc does a kill -SIGTERM (which is ignored) then does
a kill -SIGKILL if the processes are still alive..

> If signal is needed it need to be followed by normal run of killproc without
> extra argument or you need to copy killproc functionality into initscript.
How would the script know the  ha-linux heartbeat is even up 
and running? This type of functionality is only needed when the heartbeat
active, correct?

Comment 17 Tuomo Soini 2010-09-20 18:17:34 UTC

No. It's required functionality for any script.

if stop action finishes it must be fully stopped.

Comment 18 J. Bruce Fields 2010-09-20 18:32:29 UTC

OK, makes sense.

So it sounds like our choices are:

- Teach nfsd to stop ignoring SIGTERM.
- Teach killproc to send something other than SIGTERM initially, while still behaving as it does without a signal specified on the commandline.
- First do a killall -2 nfsd, then a killproc?  (Is that safe?)

Or maybe we should do an "rpc.nfsd 0", then a killproc?

Comment 19 Steve Dickson 2010-09-20 18:59:06 UTC

> - Teach nfsd to stop ignoring SIGTERM.
Just because nfsd gets signal, does not be the
process will die immediately, what happens if 
nfsd is tied in some type of I/O? The process 
will hang until the process is over, even with
ad SIGKILL, true?

I think need something like this

--- nfs.init	2009-04-24 15:26:27.000000000 -0400
+++ /tmp/nfs	2010-09-20 14:58:36.000000000 -0400
@@ -23,7 +23,21 @@
 [ -z "$RQUOTAD" ] && RQUOTAD=`type -path rpc.rquotad`
 
 RETVAL=0
-
+waittodie () {
+	pids=$1
+	alive=1
+
+	while [ -n "$pids" -a $alive -eq 1 ]
+	do
+		alive=0
+		for p in $pids
+		do
+		        if checkpid $pid; then
+				alive=1
+			fi
+		done
+	done
+}
 # See how we were called.
 case "$1" in
   start)
@@ -120,6 +134,8 @@ case "$1" in
 	echo
 	echo -n $"Shutting down NFS daemon: "
 	killproc nfsd -2
+	waittodie nfsd
+
 	echo
 	if [ -n "$RQUOTAD" -a "$RQUOTAD" != "no" ]; then
 		echo -n $"Shutting down NFS quotas: "

Comment 20 Jeff Layton 2010-09-20 19:12:58 UTC

The right thing to do here is fix your scripts to run 'rpc.nfsd 0' and then wait for the nfsd threads to go down. Signal handling with kthreads is tricky business and is best avoided.

Comment 21 J. Bruce Fields 2010-09-20 19:18:18 UTC

Looking at the definition of killproc, it appears to already do something like that.

So, yeah, probably dropping the -2 and adding an 'rpc.nfsd 0' before it is the way to go.

Comment 22 Steve Dickson 2010-09-21 12:19:36 UTC

The problem I see with using 'rpc.nfsd 0' is we will be creating 
a race.... 'rpc.nfsd 0 ; killproc nfsd' will cause killproc to 
report a failure since there not any nfsd process to kill...
similar  to 'service nfs stop ; service nfs stop'

I don't understand the comment about "Signal handling with kthreads..."
All waittodie does is used the standard init tools to
ensure all the nfsd process are gone...

Comment 23 Steve Dickson 2010-09-22 13:51:26 UTC

Ok.. I just verified that doing 'rpc.nfsd 0 ; killproc nfsd' does
indeed cause the script to fail, because the killproc can not
find any nfsd process...

cc-ing Bill Nottingham

Bill, is there anything in the init functions that would guarantee
process are kill before the init scrip moves on? 

Would you consider added a function like waittodie() (See Comment 19)
/etc/rc.d/init.d/functions

Comment 24 Jeff Layton 2010-09-22 14:19:49 UTC

I stand corrected. I forgot that svc_set_num_threads still uses send_sig() to bring down the threads:

        /* destroy old threads */
        while (nrservs < 0 &&
               (task = choose_victim(serv, pool, &state)) != NULL) {
                send_sig(SIGINT, task, 1);
                nrservs++;
        }

...so sending a SIGINT from userspace should be fine.

IIRC we kept using signals there so that we could bring down the threads in parallel. kthread_stop is synchronous, and we didn't want to wait on them to come down one at a time.

Still though, the most "future proof" method is probably to do a "rpc.nfsd 0" and just wait for the threads to come down using something like Steve's waittodie().

Comment 25 Tuomo Soini 2010-09-22 14:39:55 UTC

I'd do some changes to waittodie function.

1. There must be some limit how long to wait for.

2. When process is still running when timeout happens, I'd add

killproc nfsd -9 || :

run to make absolutely sure process is getting stopped.

Comment 26 Fabio Olive Leite 2010-09-22 18:04:54 UTC

> 1. There must be some limit how long to wait for.

There is no guarantee on how long it can take for any pending IO to complete and then unblock a disk-sleeping nfsd. So this limit should be established accepting the fact it may still be running (though likely to stop as soon as it can process signals).

> 2. When process is still running when timeout happens, I'd add
> killproc nfsd -9 || :
> run to make absolutely sure process is getting stopped.

The problem is that it doesn't make it absolutely sure. Again, waiting on some stalled IO can take long and will not answer to kill -9.

Perhaps one should just assume that sometimes the service will not stop right away and work with that fact on the rest of the system.

Comment 27 Jeff Layton 2010-09-22 18:35:14 UTC

nfsd is a kernel thread -- sending it a SIGKILL isn't any more of a "sure kill" than a SIGINT.

Comment 28 RHEL Program Management 2011-01-11 20:07:05 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 29 RHEL Program Management 2011-01-11 23:10:09 UTC

This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 30 Steve Dickson 2011-01-21 20:05:16 UTC

(In reply to comment #26)
> > 1. There must be some limit how long to wait for.
> 
> There is no guarantee on how long it can take for any pending IO to complete
> and then unblock a disk-sleeping nfsd. So this limit should be established
> accepting the fact it may still be running (though likely to stop as soon as it
> can process signals).
> 
> > 2. When process is still running when timeout happens, I'd add
> > killproc nfsd -9 || :
> > run to make absolutely sure process is getting stopped.
> 
> The problem is that it doesn't make it absolutely sure. Again, waiting on some
> stalled IO can take long and will not answer to kill -9.
> 
> Perhaps one should just assume that sometimes the service will not stop right
> away and work with that fact on the rest of the system.
I agree with this... This is the way it has worked for a long
time now... Making these type of changes this late in the release
is not a good idea... IMHO..

Comment 31 RHEL Program Management 2011-01-21 20:25:18 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.