Created attachment 362582 [details] corosync.conf Description of problem: Summary: Assuming that the user is root, logged in to machine B of a two-machine cluster (host A and B, A is the DC) and issues a QUIT signal to corosync, corosync on host B does not exit but appears to get "stuck" such that it considers itself to be alive (crm_mon shows that both hosts A and B are online) but that the cluster loses track of it (on host A, crm_mon shows that host B has become lost). This leads to a split-brain situation. Details: `crm configure show` output: node boot1 node boot2 property $id="cib-bootstrap-options" \ dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" Original State of Cluster (output of crm_mon): ============ Last updated: Wed Sep 23 15:56:24 2009 Stack: openais Current DC: boot1 - partition with quorum Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 2 Nodes configured, 2 expected votes 0 Resources configured. ============ Online: [ boot1 boot2 ] State of Cluster after QUIT signal (output of crm_mon, boot1): ============ Last updated: Wed Sep 23 15:58:27 2009 Stack: openais Current DC: boot1 - partition WITHOUT quorum Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 2 Nodes configured, 2 expected votes 0 Resources configured. ============ Online: [ boot1 ] OFFLINE: [ boot2 ] State of Cluster after QUIT signal (output of crm_mon, boot2): ============ Last updated: Wed Sep 23 15:58:35 2009 Stack: openais Current DC: boot1 - partition with quorum Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56 2 Nodes configured, 2 expected votes 0 Resources configured. ============ Online: [ boot1 boot2 ] Version-Release number of selected component (if applicable): This issue has been replicated with two sets of *.deb packages on Ubuntu Hardy Heron LTS. http://people.debian.org/~madkiss/ha-corosync (packages recompiled for Ubuntu LTS through use of *.diff.gz, *.dsc, and *.orig.tar.gz files) cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb corosync_1.0.0-5~bpo50+1_i386.deb libcorosync4_1.0.0-5~bpo50+1_i386.deb libopenais3_1.0.0-4~bpo50+1_i386.deb openais_1.0.0-4~bpo50+1_i386.deb pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/: pacemaker-openais package version 1.0.5+hg20090813-0ubuntu2~hardy1 openais package version 1.0.0-3ubuntu1~hardy1 corosync package version 1.0.0-4ubuntu1~hardy2 heartbeat-common package version heartbeat-common_2.99.2+sles11r9-5ubuntu1~hardy1 How reproducible: Very, it happens 100% of the time to me. Steps to Reproduce: 1. Clean install of Ubuntu LTS Hardy Heron 2. Network configuration as follows: boot1: eth0 is 192.168.10.192 eth1 is 172.16.1.1 boot2: eth0 is 192.168.10.193 eth1 is 172.16.1.2 3. Network wire configuration as follows: boot1:eth0 and boot2:eth0 both connect to the same switch. boot1:eth1 and boot2:eth1 are connected directly to each other via cross-over cable. 4. No firewall/iptables installed. 5. Install package pre-requisites: gawk libesmtp5 libglib2.0-0 libltdl3 libnet1 libnspr4-0d libnss3-1d libopenhpi2 libopenipmi0 libxml2 libxml2-utils libxslt1.1 6. Install required packages: corosync_1.0.0-4ubuntu1~hardy2_i386.deb libcorosync4_1.0.0-4ubuntu1~hardy2_i386.deb openais_1.0.0-3ubuntu1~hardy1_i386.deb libopenais3_1.0.0-3ubuntu1~hardy1_i386.deb heartbeat-common_2.99.2+sles11r9-5ubuntu1~hardy1_i386.deb libheartbeat2_2.99.2+sles11r9-5ubuntu1~hardy1_i386.deb pacemaker-openais_1.0.5+hg20090813-0ubuntu2~hardy1_i386.deb 7. Enable corosync startup in /etc/default/corosync. 8. Install the attached corosync.conf file in /etc/corosync/corosync.conf and generate a corosync auth key using corosync-keygen. 9. Start corosync. 10. crm configure property no-quorum-policy=ignore 11. crm configure property stonith-enabled=false 12. Attempt to shutdown corosync on the non-DC host by sending it a QUIT signal (i.e. killall -QUIT corosync). Actual results: After receiving a QUIT signal, corosync does not terminate it's children, and the node that the QUIT signal is sent on is labeled as "lost" by the cluster DC. Expected results: After receiving a QUIT signal, corosync should terminate it's children and then exit, removing the node from the cluster. Additional info: Complete log files, with debugging set to 'on', can be found at the following pastebin locations: After first QUIT signal issued on boot2: boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd boot2:/var/log/syslog: http://pastebin.com/d26fdfee After second QUIT signal issued on boot2: boot1:/var/log/syslog: http://pastebin.com/m755fb989 boot2:/var/log/syslog: http://pastebin.com/m22dcef45
Created attachment 362583 [details] Debug syslog from boot1 covering 17:48:46 to 17:50:08. Attaching the following files to this ticket. They represent a snapshot of logging/debugging that I've taken of what occurs when the corosync process on boot2 receives a QUIT signal (and then another one, one minute later). The first QUIT signal was delivered at 17:49:00: root@boot2:/tmp# date ; kill -QUIT 17620 Thu Sep 24 17:49:00 MDT 2009 The second QUIT signal was delivered at 17:50:00: root@boot2:/tmp# date ; kill -QUIT 17620 Thu Sep 24 17:50:00 MDT 2009
Created attachment 362584 [details] Debug syslog from boot2 covering 17:48:45 to 17:50:01.
Created attachment 362586 [details] strace of attrd on boot2 from 17:48:45 to process death at 17:50:01.
Created attachment 362587 [details] strace of cib on boot2 from 17:48:45 to process death at 17:50:07.
Created attachment 362588 [details] strace of corosync on boot2 from 17:48:45 to process death at 17:50:08.
Created attachment 362589 [details] strace of crmd on boot2 from 17:48:45 to process death at 17:50:00.
Created attachment 362590 [details] strace of lrmd on boot2 from 17:48:45 to process death at 17:50:02.
Created attachment 362591 [details] strace of pengine on boot2 from 17:48:45 to process death at 17:50:01.
Created attachment 362592 [details] strace of stonithd on boot2 from 17:48:45 to process death at 17:50:08.
Corosync shuts down properly on this signal. May be an integration problem with Pacemaker. Andrew will take a look when he starts. Regards -steve
Remi, After further investigation by Andrew, we found that SIGQUIT is not handled by corosync (instead only SIGINT). SIGQUIT is in fact the wrong signal for what you want. http://en.wikipedia.org/wiki/SIGQUIT According to Posix shutdown semantics, a daemon process should start the shutdown process on SIGTERM and after some time interval determined by the system configuration send a SIGKILL to the process. Unfortunately SIGTERM is not handled by corosync either. We will address the SIGTERM issue with this bugzilla. Regards -steve
Also, the init script doesn't wait around to verify that corosync actually shuts down. We probably need to replace: killproc $prog with killproc $prog -TERM echo "Waiting for resource activity to complete" while killproc $prog -0 do sleep 1 echo -n "." done
r2140 also removed the worker thread for shutdown. This puts the stack into a deadlock where corosync is waiting for pacemaker to exit and pacemaker is waiting for corosync to send/receive messages in order to be able to exit.
Created attachment 363880 [details] patch against corosync 1.1.0 winding up shutdown thread from quit-signal I observed the same behaviour of waiting for crmd to go away till eternity. At least in my scenario the patch attached does the trick. I just creates a thread doing the shutdown - as it was in previous corosync versions. Of course this doesn't solve anything regarding use of signals according to their definition and so on and it maybe opens up races which might have been the reason not to do that anymore in the past... And because of that this is not gonna be the solution but I hope my observations are of some help nevertheless...
Created attachment 363882 [details] windup shutdown thread Once again in plain text - sorry
Patch for SIGTERM support and init script updates sent to mailing list for ACK. https://lists.linux-foundation.org/pipermail/openais/2009-October/013111.html Reassigning back to Steve to fix the threading part and bumping the priority since there is no work-around.
FWIW, I've observed that sending TERM/QUIT signal to corosync starts corosync shutdown, but for some reason it fails to shutdown pacemaker. Pacemaker will start shutdown procedure but get stuck somewhere in the process. Sending another TERM/QUIT signal will start this procedure again, but then pacemaker will shutdown eventually. For Debian/Ubuntu, I've come up with this stopping mechanism that (sort of) works: start-stop-daemon --stop --quiet --retry=QUIT/10/QUIT/10 --pidfile $PIDFILE it will send QUIT signal, wait for 10 seconds, then send QUIT signal again. Now, I have to wait another 10 seconds for all pacemaker's processes to terminate so that retvalue could be 0. Both QUIT signals produce the same output from corosync: corosync[2991]: [SERV ] Unloading all corosync components corosync[2991]: [SERV ] Unloading corosync component: pacemaker v0 corosync[2991]: [pcmk ] notice: pcmk_shutdown: Begining shutdown corosync[2991]: [pcmk ] notice: stop_child: Sent -15 to crmd: [3001] pacemaker after first signal: crmd: [3001]: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated crmd: [3001]: info: crm_shutdown: Requesting shutdown crmd: [3001]: info: do_shutdown_req: Sending shutdown request to DC: node-2 pacemaker after second signal: crmd: [3001]: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated crmd: [3001]: ERROR: crm_shutdown: Escalating the shutdown crmd: [3001]: ERROR: do_log: FSA: Input I_ERROR from crm_shutdown() received in state S_NOT_DC In first case it requests shutdown, but after second signal, it escalates it. Problem with this approach is that services running on that node will keep running and if that service is an IP, user will end up with 2 machines with the same IP. So, my best guess is that lrmd is the culprit here. As a workaround for this situation I'd suggest adding 'crm node standby' before sending TERM/QUIT signals.
should hit updates repo in few days.
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle. Changing version to '12'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
The VERIFIED, FAILS_QA and RELEASE_PENDING bug states are not used by Fedora (they are used in the RHEL process). I'm closing this bug ahead of time. It is possibly fixed, but Reporter, if you can reproduce it using a current version of Fedora (version 12), please reopen it. --- Fedora Bugzappers volunteer triage team https://fedoraproject.org/wiki/BugZappers
It was fixed during the F-12 alpha phase.