Bug 592103

Summary: it can take up to 30 seconds after `service cman stop` for corosync to exit
Product: Red Hat Enterprise Linux 6 Reporter: Nate Straz <nstraz>
Component: clusterAssignee: Fabio Massimo Di Nitto <fdinitto>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: ccaulfie, cluster-maint, lhh, rpeterso, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cluster-3.0.12-2.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-07-02 18:55:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
proposed patch none

Description Nate Straz 2010-05-13 20:49:43 UTC
Description of problem:

While running the test case whiplash, which starts and stops the cluster using init scripts, verifying the cluster is down fails.

The logs from a good stop:

May 13 15:23:51 morph-03 qarshd[2695]: Running cmdline: service cman stop
May 13 15:23:51 morph-03 fenced[2529]: cpg_mcast_joined error 12 handle 440badfc00000001 complete
May 13 15:23:52 morph-03 kernel: dlm: closing connection to node 4
May 13 15:23:52 morph-03 kernel: dlm: closing connection to node 3
May 13 15:23:52 morph-03 kernel: dlm: closing connection to node 1
May 13 15:23:53 morph-03 corosync[2475]:   [SERV  ] Unloading all Corosync service engines.
May 13 15:23:53 morph-03 corosync[2475]:   [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
May 13 15:23:53 morph-03 corosync[2475]:   [SERV  ] Service engine unloaded: corosync configuration service
May 13 15:23:53 morph-03 corosync[2475]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
May 13 15:23:53 morph-03 corosync[2475]:   [SERV  ] Service engine unloaded: corosync cluster config database access v1.01
May 13 15:23:53 morph-03 corosync[2475]:   [SERV  ] Service engine unloaded: corosync profile loading service
May 13 15:23:53 morph-03 corosync[2475]:   [SERV  ] Service engine unloaded: openais checkpoint service B.01.01
May 13 15:23:53 morph-03 corosync[2475]:   [SERV  ] Service engine unloaded: corosync CMAN membership service 2.90
May 13 15:23:53 morph-03 corosync[2475]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
May 13 15:23:53 morph-03 corosync[2475]:   [MAIN  ] Corosync Cluster Engine exiting with status -1 at main.c:160.
May 13 15:23:54 morph-03 xinetd[4473]: EXIT: qarsh status=0 pid=2695 duration=3(sec)
May 13 15:23:54 morph-03 xinetd[4473]: START: qarsh pid=2764 from=::ffff:10.15.89.98
May 13 15:23:54 morph-03 qarshd[2764]: Talking to peer 10.15.89.98:52352
May 13 15:23:54 morph-03 qarshd[2764]: Running cmdline: if service cman status; then /bin/false; else /bin/true; fi


The logs from a failed stop:

May 13 15:23:51 morph-01 qarshd[5159]: Running cmdline: service cman stop
May 13 15:23:52 morph-01 kernel: dlm: closing connection to node 4
May 13 15:23:52 morph-01 kernel: dlm: closing connection to node 3
May 13 15:23:52 morph-01 kernel: dlm: closing connection to node 1
May 13 15:23:53 morph-01 corosync[4932]:   [SERV  ] Unloading all Corosync service engines.
May 13 15:23:54 morph-01 xinetd[1462]: EXIT: qarsh status=0 pid=5159 duration=3(sec)
May 13 15:23:54 morph-01 xinetd[1462]: START: qarsh pid=5233 from=::ffff:10.15.89.98
May 13 15:23:54 morph-01 qarshd[5233]: Talking to peer 10.15.89.98:52354
May 13 15:23:54 morph-01 qarshd[5233]: Running cmdline: if service cman status; then /bin/false; else /bin/true; fi
May 13 15:23:54 morph-01 xinetd[1462]: EXIT: qarsh status=0 pid=5233 duration=0(sec)
May 13 15:24:23 morph-01 corosync[4932]:   [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
May 13 15:24:23 morph-01 corosync[4932]:   [SERV  ] Service engine unloaded: corosync configuration service
May 13 15:24:23 morph-01 corosync[4932]:   [CMAN  ] quorum lost, blocking activity
May 13 15:24:23 morph-01 corosync[4932]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 13 15:24:23 morph-01 corosync[4932]:   [QUORUM] Members[0]:
May 13 15:24:23 morph-01 corosync[4932]:   [QUORUM] Members[0]:
May 13 15:24:23 morph-01 corosync[4932]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 13 15:24:23 morph-01 corosync[4932]:   [QUORUM] Members[1]: 1
May 13 15:24:23 morph-01 corosync[4932]:   [QUORUM] Members[1]: 1
May 13 15:24:23 morph-01 corosync[4932]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
May 13 15:24:23 morph-01 corosync[4932]:   [SERV  ] Service engine unloaded: corosync cluster config database access v1.01
May 13 15:24:23 morph-01 corosync[4932]:   [SERV  ] Service engine unloaded: corosync profile loading service
May 13 15:24:23 morph-01 corosync[4932]:   [SERV  ] Service engine unloaded: openais checkpoint service B.01.01
May 13 15:24:23 morph-01 corosync[4932]:   [SERV  ] Service engine unloaded: corosync CMAN membership service 2.90
May 13 15:24:23 morph-01 corosync[4932]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
May 13 15:24:23 morph-01 corosync[4932]:   [MAIN  ] Corosync Cluster Engine exiting with status -1 at main.c:160.


Version-Release number of selected component (if applicable):
corosync-1.2.1-2.el6.i686
cman-3.0.12-1.el6.i686

How reproducible:
Easily

Steps to Reproduce:
Run whiplash test case
  
Actual results:
[cman] [whiplash] [verify-down] [morph-01] cman is not running
[cman] [whiplash] [verify-down] [morph-03] corosync is stopped
[cman] [whiplash] [verify-down] [morph-04] corosync is stopped

Expected results:
[cman] [whiplash] [verify-down] [morph-01] corosync is stopped
[cman] [whiplash] [verify-down] [morph-03] corosync is stopped
[cman] [whiplash] [verify-down] [morph-04] corosync is stopped


Additional info:

Comment 1 Fabio Massimo Di Nitto 2010-05-13 20:55:56 UTC
Sorry I don´t understand what this test case is supposed to check.

Services can take time to stop. If you check corosync after 30 seconds, is it still running? is it hanging?

Also please note the error you have in the "good log". It´s possible that fenced didn´t have time to join the group and that you did issue a shutdown command (that went in faster), while in the second one, fenced did take more time to allow corosync/cman to shutdown.

Is there a specific restriction that they need to stop within a certain amount of time?

Comment 2 Nate Straz 2010-05-13 21:07:52 UTC
Generally speaking, an init script will wait for the service it governs to completely shutdown before exiting.  This test case verifies that.  It also verifies that it can be repeated and no components fail when it is repeated.

Comment 3 Fabio Massimo Di Nitto 2010-05-13 21:32:42 UTC
Please also provide the other info I requested in respect of corosync exiting later on so on. Those are relevant as I have never seen this behavior in my start/stop loop tests.

Comment 4 Nate Straz 2010-05-13 21:47:21 UTC
corosync is not hanging, it just takes extra time to shut down.  After 30 seconds, `service cman status` will report "corosync is stopped."

Whiplash runs in parallel the following pseudo code

while iterating:
  foreach node:
    service cman start
  foreach node:
    service cman status
  foreach node:
    service cman stop
  foreach node:
    `service cman status` == "corosync is stopped"

Comment 6 RHEL Program Management 2010-05-13 22:54:33 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 7 Fabio Massimo Di Nitto 2010-05-14 04:25:59 UTC
Created attachment 413941 [details]
proposed patch

Comment 8 Fabio Massimo Di Nitto 2010-05-14 04:27:00 UTC
Dean, can you please give it a spin with this patch to /etc/init.d/cman ?

Thanks
fabio

Comment 9 Nate Straz 2010-05-14 16:36:27 UTC
That patch allows the test to complete, but I'm still concerned.  Why does it sometimes take corosync 30 seconds to unload on one node?

Comment 10 Fabio Massimo Di Nitto 2010-05-14 18:38:41 UTC
thanks for testing. I suggest you file a separate bug for corosync and request info to their team.

Comment 11 Fabio Massimo Di Nitto 2010-05-15 06:01:47 UTC
git commit 67aa7e900364b651aaf0a8b89acbd8157aa551cf on RHEL6 branch.

Comment 13 Nate Straz 2010-06-03 21:11:07 UTC
Made it through whiplash with cman-3.0.12-6.el6.

Comment 14 releng-rhel@redhat.com 2010-07-02 18:55:09 UTC
Red Hat Enterprise Linux Beta 2 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.