Description of problem: If a node fails while service was stopping, the service will not be restarted elsewhere in the cluster, as the service gets stuck in "stopping" state. Version-Release number of selected component (if applicable): 2.6.9-34.0.1 How reproducible: start service on one node1 type "reboot" on node1, and immediately pull power service apache should be stuck in a "stopped" state In our case, this occured because we simulated a failure by pressing (and holding) the power button. On first button press, an ACPI event is generated, and the system begins a halt. After 6 seconds into the halt, the system power is removed, resulting in a "failed" status before the cluster services have appropriately stopped. Actual results: Service apache gets stuck in state "stopped" Expected results: Service apache is restarted on new node when former node is marked as failed. Additional info: Relevant log messages from non-failing node: <snip> May 26 08:52:33 vhaisalnx1 kernel: CMAN: removing node vhaisalnx2 from the cluster : Missed too many heartbeats May 26 08:52:32 vhaisalnx1 last message repeated 9 times May 26 08:52:33 vhaisalnx1 clurgmgrd[17020]: <debug> Suspend Event May 26 08:52:33 vhaisalnx1 fenced[17005]: vhaisalnx2 not a cluster member after 0 sec post_fail_delay May 26 08:52:33 vhaisalnx1 fenced[17005]: vhaisalnx2 not a cluster member after 0 sec post_fail_delay May 26 08:52:33 vhaisalnx1 fenced[17005]: fencing node "vhaisalnx2" May 26 08:52:33 vhaisalnx1 fenced[17005]: fencing node "vhaisalnx2" May 26 08:52:34 vhaisalnx1 clurgmgrd[17020]: <debug> Sending service states to fd11 May 26 08:52:45 vhaisalnx1 fenced[17005]: fence "vhaisalnx2" success May 26 08:52:45 vhaisalnx1 fenced[17005]: fence "vhaisalnx2" success May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Sending service states to fd11 May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> NULL cluster event May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Membership Change Event May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> Magma Event: Membership Change May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> Magma Event: Membership Change May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> I am node #1 May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> State change: vhaisalnx2 DOWN May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> State change: vhaisalnx2 DOWN May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Evaluating RG apache, state stopping, owner (null) May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> RG apache is stopping </snip> Note that that service does not automatically recover from this state. The only way to recover from this situation is to restart cluster services.
Created attachment 130996 [details] Fixes behavior This is in CVS and will go out with the next update.
Created attachment 131553 [details] Corrected patch; original was backwards
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0557.html