Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 193255

Summary: rgmanager fails to properly start service after node failure
Product: [Retired] Red Hat Cluster Suite Reporter: Justin Nemmers <jnemmers>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2006-0557 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-08-15 19:52:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Fixes behavior
none
Corrected patch; original was backwards none

Description Justin Nemmers 2006-05-26 16:42:08 UTC
Description of problem:

If a node fails while service was stopping, the service will not be restarted elsewhere in the cluster, as 
the service gets stuck in "stopping" state.

Version-Release number of selected component (if applicable):
2.6.9-34.0.1

How reproducible:
start service on one node1
type "reboot" on node1, and immediately pull power
service apache should be stuck in a "stopped" state

In our case, this occured because we simulated a failure by pressing (and holding) the power button.  
On first button press, an ACPI event is generated, and the system begins a halt.  After 6 seconds into 
the halt, the system power is removed, resulting in a "failed" status before the cluster services have 
appropriately stopped.



Actual results:
Service apache gets stuck in state "stopped"

Expected results:
Service apache is restarted on new node when former node is marked as failed.

Additional info:
Relevant log messages from non-failing node:

<snip>
May 26 08:52:33 vhaisalnx1 kernel: CMAN: removing node vhaisalnx2 from the cluster : Missed too 
many heartbeats
May 26 08:52:32 vhaisalnx1 last message repeated 9 times
May 26 08:52:33 vhaisalnx1 clurgmgrd[17020]: <debug> Suspend Event 
May 26 08:52:33 vhaisalnx1 fenced[17005]: vhaisalnx2 not a cluster member after 0 sec post_fail_delay
May 26 08:52:33 vhaisalnx1 fenced[17005]: vhaisalnx2 not a cluster member after 0 sec post_fail_delay
May 26 08:52:33 vhaisalnx1 fenced[17005]: fencing node "vhaisalnx2"
May 26 08:52:33 vhaisalnx1 fenced[17005]: fencing node "vhaisalnx2"
May 26 08:52:34 vhaisalnx1 clurgmgrd[17020]: <debug> Sending service states to fd11 
May 26 08:52:45 vhaisalnx1 fenced[17005]: fence "vhaisalnx2" success
May 26 08:52:45 vhaisalnx1 fenced[17005]: fence "vhaisalnx2" success
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Sending service states to fd11 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> NULL cluster event 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Membership Change Event 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> Magma Event: Membership Change 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> Magma Event: Membership Change 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> I am node #1 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> State change: vhaisalnx2 DOWN 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> State change: vhaisalnx2 DOWN 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Evaluating RG apache, state stopping, owner 
(null) 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> RG apache is stopping 
</snip>

Note that that service does not automatically recover from this state.  The only way to recover from this 
situation is to restart cluster services.

Comment 1 Lon Hohberger 2006-06-15 18:06:18 UTC
Created attachment 130996 [details]
Fixes behavior

This is in CVS and will go out with the next update.

Comment 2 Lon Hohberger 2006-06-26 16:31:37 UTC
Created attachment 131553 [details]
Corrected patch; original was backwards

Comment 6 Lon Hohberger 2006-08-15 19:52:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0557.html