Bug 525589
Description
Remi Broemeling
2009-09-24 23:26:57 UTC
Created attachment 362583 [details]
Debug syslog from boot1 covering 17:48:46 to 17:50:08.
Attaching the following files to this ticket. They represent a snapshot of logging/debugging that I've taken of what occurs when the corosync process on boot2 receives a QUIT signal (and then another one, one minute later).
The first QUIT signal was delivered at 17:49:00:
root@boot2:/tmp# date ; kill -QUIT 17620
Thu Sep 24 17:49:00 MDT 2009
The second QUIT signal was delivered at 17:50:00:
root@boot2:/tmp# date ; kill -QUIT 17620
Thu Sep 24 17:50:00 MDT 2009
Created attachment 362584 [details]
Debug syslog from boot2 covering 17:48:45 to 17:50:01.
Created attachment 362586 [details]
strace of attrd on boot2 from 17:48:45 to process death at 17:50:01.
Created attachment 362587 [details]
strace of cib on boot2 from 17:48:45 to process death at 17:50:07.
Created attachment 362588 [details]
strace of corosync on boot2 from 17:48:45 to process death at 17:50:08.
Created attachment 362589 [details]
strace of crmd on boot2 from 17:48:45 to process death at 17:50:00.
Created attachment 362590 [details]
strace of lrmd on boot2 from 17:48:45 to process death at 17:50:02.
Created attachment 362591 [details]
strace of pengine on boot2 from 17:48:45 to process death at 17:50:01.
Created attachment 362592 [details]
strace of stonithd on boot2 from 17:48:45 to process death at 17:50:08.
Corosync shuts down properly on this signal. May be an integration problem with Pacemaker. Andrew will take a look when he starts. Regards -steve Remi, After further investigation by Andrew, we found that SIGQUIT is not handled by corosync (instead only SIGINT). SIGQUIT is in fact the wrong signal for what you want. http://en.wikipedia.org/wiki/SIGQUIT According to Posix shutdown semantics, a daemon process should start the shutdown process on SIGTERM and after some time interval determined by the system configuration send a SIGKILL to the process. Unfortunately SIGTERM is not handled by corosync either. We will address the SIGTERM issue with this bugzilla. Regards -steve Also, the init script doesn't wait around to verify that corosync actually shuts down. We probably need to replace: killproc $prog with killproc $prog -TERM echo "Waiting for resource activity to complete" while killproc $prog -0 do sleep 1 echo -n "." done r2140 also removed the worker thread for shutdown. This puts the stack into a deadlock where corosync is waiting for pacemaker to exit and pacemaker is waiting for corosync to send/receive messages in order to be able to exit. Created attachment 363880 [details]
patch against corosync 1.1.0 winding up shutdown thread from quit-signal
I observed the same behaviour of waiting for crmd to go away till eternity.
At least in my scenario the patch attached does the trick. I just creates
a thread doing the shutdown - as it was in previous corosync versions.
Of course this doesn't solve anything regarding use of signals according
to their definition and so on and it maybe opens up races which might
have been the reason not to do that anymore in the past...
And because of that this is not gonna be the solution but I hope my
observations are of some help nevertheless...
Created attachment 363882 [details]
windup shutdown thread
Once again in plain text - sorry
Patch for SIGTERM support and init script updates sent to mailing list for ACK. https://lists.linux-foundation.org/pipermail/openais/2009-October/013111.html Reassigning back to Steve to fix the threading part and bumping the priority since there is no work-around. FWIW, I've observed that sending TERM/QUIT signal to corosync starts corosync shutdown, but for some reason it fails to shutdown pacemaker. Pacemaker will start shutdown procedure but get stuck somewhere in the process. Sending another TERM/QUIT signal will start this procedure again, but then pacemaker will shutdown eventually. For Debian/Ubuntu, I've come up with this stopping mechanism that (sort of) works: start-stop-daemon --stop --quiet --retry=QUIT/10/QUIT/10 --pidfile $PIDFILE it will send QUIT signal, wait for 10 seconds, then send QUIT signal again. Now, I have to wait another 10 seconds for all pacemaker's processes to terminate so that retvalue could be 0. Both QUIT signals produce the same output from corosync: corosync[2991]: [SERV ] Unloading all corosync components corosync[2991]: [SERV ] Unloading corosync component: pacemaker v0 corosync[2991]: [pcmk ] notice: pcmk_shutdown: Begining shutdown corosync[2991]: [pcmk ] notice: stop_child: Sent -15 to crmd: [3001] pacemaker after first signal: crmd: [3001]: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated crmd: [3001]: info: crm_shutdown: Requesting shutdown crmd: [3001]: info: do_shutdown_req: Sending shutdown request to DC: node-2 pacemaker after second signal: crmd: [3001]: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated crmd: [3001]: ERROR: crm_shutdown: Escalating the shutdown crmd: [3001]: ERROR: do_log: FSA: Input I_ERROR from crm_shutdown() received in state S_NOT_DC In first case it requests shutdown, but after second signal, it escalates it. Problem with this approach is that services running on that node will keep running and if that service is an IP, user will end up with 2 machines with the same IP. So, my best guess is that lrmd is the culprit here. As a workaround for this situation I'd suggest adding 'crm node standby' before sending TERM/QUIT signals. should hit updates repo in few days. This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle. Changing version to '12'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping The VERIFIED, FAILS_QA and RELEASE_PENDING bug states are not used by Fedora (they are used in the RHEL process). I'm closing this bug ahead of time. It is possibly fixed, but Reporter, if you can reproduce it using a current version of Fedora (version 12), please reopen it. --- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers It was fixed during the F-12 alpha phase. |