Bug 193255 - rgmanager fails to properly start service after node failure
rgmanager fails to properly start service after node failure
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: rgmanager (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-05-26 12:42 EDT by Justin Nemmers
Modified: 2009-04-16 16:20 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2006-0557
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-08-15 15:52:41 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Fixes behavior (555 bytes, text/x-patch)
2006-06-15 14:06 EDT, Lon Hohberger
no flags Details
Corrected patch; original was backwards (600 bytes, patch)
2006-06-26 12:31 EDT, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Justin Nemmers 2006-05-26 12:42:08 EDT
Description of problem:

If a node fails while service was stopping, the service will not be restarted elsewhere in the cluster, as 
the service gets stuck in "stopping" state.

Version-Release number of selected component (if applicable):
2.6.9-34.0.1

How reproducible:
start service on one node1
type "reboot" on node1, and immediately pull power
service apache should be stuck in a "stopped" state

In our case, this occured because we simulated a failure by pressing (and holding) the power button.  
On first button press, an ACPI event is generated, and the system begins a halt.  After 6 seconds into 
the halt, the system power is removed, resulting in a "failed" status before the cluster services have 
appropriately stopped.



Actual results:
Service apache gets stuck in state "stopped"

Expected results:
Service apache is restarted on new node when former node is marked as failed.

Additional info:
Relevant log messages from non-failing node:

<snip>
May 26 08:52:33 vhaisalnx1 kernel: CMAN: removing node vhaisalnx2 from the cluster : Missed too 
many heartbeats
May 26 08:52:32 vhaisalnx1 last message repeated 9 times
May 26 08:52:33 vhaisalnx1 clurgmgrd[17020]: <debug> Suspend Event 
May 26 08:52:33 vhaisalnx1 fenced[17005]: vhaisalnx2 not a cluster member after 0 sec post_fail_delay
May 26 08:52:33 vhaisalnx1 fenced[17005]: vhaisalnx2 not a cluster member after 0 sec post_fail_delay
May 26 08:52:33 vhaisalnx1 fenced[17005]: fencing node "vhaisalnx2"
May 26 08:52:33 vhaisalnx1 fenced[17005]: fencing node "vhaisalnx2"
May 26 08:52:34 vhaisalnx1 clurgmgrd[17020]: <debug> Sending service states to fd11 
May 26 08:52:45 vhaisalnx1 fenced[17005]: fence "vhaisalnx2" success
May 26 08:52:45 vhaisalnx1 fenced[17005]: fence "vhaisalnx2" success
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Sending service states to fd11 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> NULL cluster event 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Membership Change Event 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> Magma Event: Membership Change 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> Magma Event: Membership Change 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> I am node #1 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> State change: vhaisalnx2 DOWN 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> State change: vhaisalnx2 DOWN 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Evaluating RG apache, state stopping, owner 
(null) 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> RG apache is stopping 
</snip>

Note that that service does not automatically recover from this state.  The only way to recover from this 
situation is to restart cluster services.
Comment 1 Lon Hohberger 2006-06-15 14:06:18 EDT
Created attachment 130996 [details]
Fixes behavior

This is in CVS and will go out with the next update.
Comment 2 Lon Hohberger 2006-06-26 12:31:37 EDT
Created attachment 131553 [details]
Corrected patch; original was backwards
Comment 6 Lon Hohberger 2006-08-15 15:52:41 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0557.html

Note You need to log in before you can comment on or make changes to this bug.