Bug 193255 - rgmanager fails to properly start service after node failure
Summary: rgmanager fails to properly start service after node failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: rgmanager
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-05-26 16:42 UTC by Justin Nemmers
Modified: 2009-04-16 20:20 UTC (History)
2 users (show)

Fixed In Version: RHBA-2006-0557
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-08-15 19:52:41 UTC
Embargoed:


Attachments (Terms of Use)
Fixes behavior (555 bytes, text/x-patch)
2006-06-15 18:06 UTC, Lon Hohberger
no flags Details
Corrected patch; original was backwards (600 bytes, patch)
2006-06-26 16:31 UTC, Lon Hohberger
no flags Details | Diff

Description Justin Nemmers 2006-05-26 16:42:08 UTC
Description of problem:

If a node fails while service was stopping, the service will not be restarted elsewhere in the cluster, as 
the service gets stuck in "stopping" state.

Version-Release number of selected component (if applicable):
2.6.9-34.0.1

How reproducible:
start service on one node1
type "reboot" on node1, and immediately pull power
service apache should be stuck in a "stopped" state

In our case, this occured because we simulated a failure by pressing (and holding) the power button.  
On first button press, an ACPI event is generated, and the system begins a halt.  After 6 seconds into 
the halt, the system power is removed, resulting in a "failed" status before the cluster services have 
appropriately stopped.



Actual results:
Service apache gets stuck in state "stopped"

Expected results:
Service apache is restarted on new node when former node is marked as failed.

Additional info:
Relevant log messages from non-failing node:

<snip>
May 26 08:52:33 vhaisalnx1 kernel: CMAN: removing node vhaisalnx2 from the cluster : Missed too 
many heartbeats
May 26 08:52:32 vhaisalnx1 last message repeated 9 times
May 26 08:52:33 vhaisalnx1 clurgmgrd[17020]: <debug> Suspend Event 
May 26 08:52:33 vhaisalnx1 fenced[17005]: vhaisalnx2 not a cluster member after 0 sec post_fail_delay
May 26 08:52:33 vhaisalnx1 fenced[17005]: vhaisalnx2 not a cluster member after 0 sec post_fail_delay
May 26 08:52:33 vhaisalnx1 fenced[17005]: fencing node "vhaisalnx2"
May 26 08:52:33 vhaisalnx1 fenced[17005]: fencing node "vhaisalnx2"
May 26 08:52:34 vhaisalnx1 clurgmgrd[17020]: <debug> Sending service states to fd11 
May 26 08:52:45 vhaisalnx1 fenced[17005]: fence "vhaisalnx2" success
May 26 08:52:45 vhaisalnx1 fenced[17005]: fence "vhaisalnx2" success
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Sending service states to fd11 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> NULL cluster event 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Membership Change Event 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> Magma Event: Membership Change 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> Magma Event: Membership Change 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> I am node #1 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> State change: vhaisalnx2 DOWN 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> State change: vhaisalnx2 DOWN 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Evaluating RG apache, state stopping, owner 
(null) 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> RG apache is stopping 
</snip>

Note that that service does not automatically recover from this state.  The only way to recover from this 
situation is to restart cluster services.

Comment 1 Lon Hohberger 2006-06-15 18:06:18 UTC
Created attachment 130996 [details]
Fixes behavior

This is in CVS and will go out with the next update.

Comment 2 Lon Hohberger 2006-06-26 16:31:37 UTC
Created attachment 131553 [details]
Corrected patch; original was backwards

Comment 6 Lon Hohberger 2006-08-15 19:52:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0557.html


Note You need to log in before you can comment on or make changes to this bug.