Bug 525589

Summary:

Corosync does not shutdown on receipt of SIGTERM with pacemaker service engine

Product:

[Fedora] Fedora

Reporter:

Remi Broemeling <remi>

Component:

corosync

Assignee:

Steven Dake <sdake>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

high

Version:

CC:

abeekhof, agk, andrew, fdinitto, ivoks, sdake

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

corosync-1.1.1.fc11/fc12/rawhide

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

526968 (view as bug list)

Environment:

Last Closed:

2009-12-04 18:27:00 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
corosync.conf	none
Debug syslog from boot1 covering 17:48:46 to 17:50:08.	none
Debug syslog from boot2 covering 17:48:45 to 17:50:01.	none
strace of attrd on boot2 from 17:48:45 to process death at 17:50:01.	none
strace of cib on boot2 from 17:48:45 to process death at 17:50:07.	none
strace of corosync on boot2 from 17:48:45 to process death at 17:50:08.	none
strace of crmd on boot2 from 17:48:45 to process death at 17:50:00.	none
strace of lrmd on boot2 from 17:48:45 to process death at 17:50:02.	none
strace of pengine on boot2 from 17:48:45 to process death at 17:50:01.	none
strace of stonithd on boot2 from 17:48:45 to process death at 17:50:08.	none
patch against corosync 1.1.0 winding up shutdown thread from quit-signal	none
windup shutdown thread	none

Description Remi Broemeling 2009-09-24 23:26:57 UTC

Created attachment 362582 [details]
corosync.conf

Description of problem:

Summary:

Assuming that the user is root, logged in to machine B of a two-machine cluster (host A and B, A is the DC) and issues a QUIT signal to corosync, corosync on host B does not exit but appears to get "stuck" such that it considers itself to be alive (crm_mon shows that both hosts A and B are online) but that the cluster loses track of it (on host A, crm_mon shows that host B has become lost).  This leads to a split-brain situation.

Details:

`crm configure show` output:
    node boot1
    node boot2
    property $id="cib-bootstrap-options" \

dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
            cluster-infrastructure="openais" \
            expected-quorum-votes="2" \
            stonith-enabled="false" \
            no-quorum-policy="ignore"

Original State of Cluster (output of crm_mon):
  ============
  Last updated: Wed Sep 23 15:56:24 2009
  Stack: openais
  Current DC: boot1 - partition with quorum
  Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
  2 Nodes configured, 2 expected votes
  0 Resources configured.
  ============

  Online: [ boot1 boot2 ]

State of Cluster after QUIT signal (output of crm_mon, boot1):
  ============
  Last updated: Wed Sep 23 15:58:27 2009
  Stack: openais
  Current DC: boot1 - partition WITHOUT quorum
  Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
  2 Nodes configured, 2 expected votes
  0 Resources configured.
  ============

  Online: [ boot1 ]
  OFFLINE: [ boot2 ]

State of Cluster after QUIT signal (output of crm_mon, boot2):
  ============
  Last updated: Wed Sep 23 15:58:35 2009
  Stack: openais
  Current DC: boot1 - partition with quorum
  Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
  2 Nodes configured, 2 expected votes
  0 Resources configured.
  ============

  Online: [ boot1 boot2 ]

Version-Release number of selected component (if applicable):

This issue has been replicated with two sets of *.deb packages on Ubuntu Hardy Heron LTS.

http://people.debian.org/~madkiss/ha-corosync (packages recompiled for Ubuntu LTS through use of *.diff.gz, *.dsc, and *.orig.tar.gz files)
  cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
  corosync_1.0.0-5~bpo50+1_i386.deb
  libcorosync4_1.0.0-5~bpo50+1_i386.deb
  libopenais3_1.0.0-4~bpo50+1_i386.deb
  openais_1.0.0-4~bpo50+1_i386.deb
  pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb

http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
  pacemaker-openais package version 1.0.5+hg20090813-0ubuntu2~hardy1
  openais package version 1.0.0-3ubuntu1~hardy1
  corosync package version 1.0.0-4ubuntu1~hardy2
  heartbeat-common package version heartbeat-common_2.99.2+sles11r9-5ubuntu1~hardy1

How reproducible:

Very, it happens 100% of the time to me.

Steps to Reproduce:
1. Clean install of Ubuntu LTS Hardy Heron
2. Network configuration as follows:
   boot1:
     eth0 is 192.168.10.192
     eth1 is 172.16.1.1
   boot2:
     eth0 is 192.168.10.193
     eth1 is 172.16.1.2
3. Network wire configuration as follows:
   boot1:eth0 and boot2:eth0 both connect to the same switch.
   boot1:eth1 and boot2:eth1 are connected directly to each other via cross-over cable.
4. No firewall/iptables installed.
5. Install package pre-requisites:
   gawk 
   libesmtp5 
   libglib2.0-0 
   libltdl3 
   libnet1 
   libnspr4-0d 
   libnss3-1d 
   libopenhpi2 
   libopenipmi0 
   libxml2 
   libxml2-utils 
   libxslt1.1
6. Install required packages:
   corosync_1.0.0-4ubuntu1~hardy2_i386.deb
   libcorosync4_1.0.0-4ubuntu1~hardy2_i386.deb
   openais_1.0.0-3ubuntu1~hardy1_i386.deb
   libopenais3_1.0.0-3ubuntu1~hardy1_i386.deb
   heartbeat-common_2.99.2+sles11r9-5ubuntu1~hardy1_i386.deb
   libheartbeat2_2.99.2+sles11r9-5ubuntu1~hardy1_i386.deb
   pacemaker-openais_1.0.5+hg20090813-0ubuntu2~hardy1_i386.deb
7. Enable corosync startup in /etc/default/corosync.
8. Install the attached corosync.conf file in /etc/corosync/corosync.conf and generate a corosync auth key using corosync-keygen.
9. Start corosync.
10. crm configure property no-quorum-policy=ignore
11. crm configure property stonith-enabled=false
12. Attempt to shutdown corosync on the non-DC host by sending it a QUIT signal (i.e. killall -QUIT corosync).

Actual results:

After receiving a QUIT signal, corosync does not terminate it's children, and the node that the QUIT signal is sent on is labeled as "lost" by the cluster DC.

Expected results:

After receiving a QUIT signal, corosync should terminate it's children and then exit, removing the node from the cluster.

Additional info:

Complete log files, with debugging set to 'on', can be found at the following pastebin locations:

  After first QUIT signal issued on boot2:
    boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
    boot2:/var/log/syslog: http://pastebin.com/d26fdfee
  After second QUIT signal issued on boot2:
    boot1:/var/log/syslog: http://pastebin.com/m755fb989
    boot2:/var/log/syslog: http://pastebin.com/m22dcef45

Comment 1 Remi Broemeling 2009-09-25 00:03:16 UTC

Created attachment 362583 [details]
Debug syslog from boot1 covering 17:48:46 to 17:50:08.

Attaching the following files to this ticket.  They represent a snapshot of logging/debugging that I've taken of what occurs when the corosync process on boot2 receives a QUIT signal (and then another one, one minute later).

The first QUIT signal was delivered at 17:49:00:
  root@boot2:/tmp# date ; kill -QUIT 17620
  Thu Sep 24 17:49:00 MDT 2009

The second QUIT signal was delivered at 17:50:00:
  root@boot2:/tmp# date ; kill -QUIT 17620
  Thu Sep 24 17:50:00 MDT 2009

Comment 2 Remi Broemeling 2009-09-25 00:04:00 UTC

Created attachment 362584 [details]
 Debug syslog from boot2 covering 17:48:45 to 17:50:01.

Comment 3 Remi Broemeling 2009-09-25 00:05:01 UTC

Created attachment 362586 [details]
strace of attrd on boot2 from 17:48:45 to process death at 17:50:01.

Comment 4 Remi Broemeling 2009-09-25 00:05:40 UTC

Created attachment 362587 [details]
 strace of cib on boot2 from 17:48:45 to process death at 17:50:07.

Comment 5 Remi Broemeling 2009-09-25 00:06:23 UTC

Created attachment 362588 [details]
 strace of corosync on boot2 from 17:48:45 to process death at 17:50:08.

Comment 6 Remi Broemeling 2009-09-25 00:06:55 UTC

Created attachment 362589 [details]
 strace of crmd on boot2 from 17:48:45 to process death at 17:50:00.

Comment 7 Remi Broemeling 2009-09-25 00:07:38 UTC

Created attachment 362590 [details]
 strace of lrmd on boot2 from 17:48:45 to process death at 17:50:02.

Comment 8 Remi Broemeling 2009-09-25 00:08:03 UTC

Created attachment 362591 [details]
 strace of pengine on boot2 from 17:48:45 to process death at 17:50:01.

Comment 9 Remi Broemeling 2009-09-25 00:08:33 UTC

Created attachment 362592 [details]
 strace of stonithd on boot2 from 17:48:45 to process death at 17:50:08.

Comment 10 Steven Dake 2009-09-28 16:29:31 UTC

Corosync shuts down properly on this signal.  May be an integration problem with Pacemaker.  Andrew will take a look when he starts.

Regards
-steve

Comment 11 Steven Dake 2009-10-02 18:44:26 UTC

Remi,

After further investigation by Andrew, we found that SIGQUIT is not handled by corosync (instead only SIGINT).  SIGQUIT is in fact the wrong signal for what you want.

http://en.wikipedia.org/wiki/SIGQUIT

According to Posix shutdown semantics, a daemon process should start the shutdown process on SIGTERM and after some time interval determined by the system configuration send a SIGKILL to the process.

Unfortunately SIGTERM is not handled by corosync either.  We will address the SIGTERM issue with this bugzilla.

Regards
-steve

Comment 12 Andrew Beekhof 2009-10-05 10:36:47 UTC

Also, the init script doesn't wait around to verify that corosync actually shuts down.

We probably need to replace:

    killproc $prog

with

    killproc $prog -TERM
    echo "Waiting for resource activity to complete"
    while
        killproc $prog -0
    do
	sleep 1
	echo -n "."
    done

Comment 13 Andrew Beekhof 2009-10-06 13:43:50 UTC

r2140 also removed the worker thread for shutdown.

This puts the stack into a deadlock where corosync is waiting for pacemaker to exit and pacemaker is waiting for corosync to send/receive messages in order to be able to exit.

Comment 14 klaus wenninger 2009-10-06 19:07:06 UTC

Created attachment 363880 [details]
patch against corosync 1.1.0 winding up shutdown thread from quit-signal

I observed the same behaviour of waiting for crmd to go away till eternity.
At least in my scenario the patch attached does the trick. I just creates
a thread doing the shutdown - as it was in previous corosync versions.
Of course this doesn't solve anything regarding use of signals according
to their definition and so on and it maybe opens up races which might
have been the reason not to do that anymore in the past...
And because of that this is not gonna be the solution but I hope my
observations are of some help nevertheless...

Comment 15 klaus wenninger 2009-10-06 19:16:59 UTC

Created attachment 363882 [details]
windup shutdown thread

Once again in plain text - sorry

Comment 16 Andrew Beekhof 2009-10-07 07:59:31 UTC

Patch for SIGTERM support and init script updates sent to mailing list for ACK.
  
https://lists.linux-foundation.org/pipermail/openais/2009-October/013111.html

Reassigning back to Steve to fix the threading part and bumping the priority
since there is no work-around.

Comment 17 Ante Karamatic 2009-10-08 10:31:34 UTC

FWIW, I've observed that sending TERM/QUIT signal to corosync starts corosync shutdown, but for some reason it fails to shutdown pacemaker. Pacemaker will start shutdown procedure but get stuck somewhere in the process. Sending another TERM/QUIT signal will start this procedure again, but then pacemaker will shutdown eventually. For Debian/Ubuntu, I've come up with this stopping mechanism that (sort of) works:

start-stop-daemon --stop --quiet --retry=QUIT/10/QUIT/10 --pidfile $PIDFILE

it will send QUIT signal, wait for 10 seconds, then send QUIT signal again. Now, I have to wait another 10 seconds for all pacemaker's processes to terminate so that retvalue could be 0.

Both QUIT signals produce the same output from corosync:

corosync[2991]:   [SERV  ] Unloading all corosync components
corosync[2991]:   [SERV  ] Unloading corosync component: pacemaker v0
corosync[2991]:   [pcmk  ] notice: pcmk_shutdown: Begining shutdown
corosync[2991]:   [pcmk  ] notice: stop_child: Sent -15 to crmd: [3001]

pacemaker after first signal:

crmd: [3001]: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated
crmd: [3001]: info: crm_shutdown: Requesting shutdown
crmd: [3001]: info: do_shutdown_req: Sending shutdown request to DC: node-2

pacemaker after second signal:

crmd: [3001]: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated
crmd: [3001]: ERROR: crm_shutdown: Escalating the shutdown
crmd: [3001]: ERROR: do_log: FSA: Input I_ERROR from crm_shutdown() received in state S_NOT_DC

In first case it requests shutdown, but after second signal, it escalates it.

Problem with this approach is that services running on that node will keep running and if that service is an IP, user will end up with 2 machines with the same IP. So, my best guess is that lrmd is the culprit here.

As a workaround for this situation I'd suggest adding 'crm node standby' before sending TERM/QUIT signals.

Comment 18 Steven Dake 2009-10-21 16:14:03 UTC

should hit updates repo in few days.

Comment 19 Bug Zapper 2009-11-16 12:52:52 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 20 Vedran Miletić 2009-12-04 18:27:00 UTC

The VERIFIED, FAILS_QA and RELEASE_PENDING bug states are not used by Fedora (they are used in the RHEL process).

I'm closing this bug ahead of time. It is possibly fixed, but Reporter, if you can reproduce it using a current version of Fedora (version 12), please reopen it.

---

Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 21 Andrew Beekhof 2009-12-05 06:49:38 UTC

It was fixed during the F-12 alpha phase.