193255 – rgmanager fails to properly start service after node failure

Bug 193255 - rgmanager fails to properly start service after node failure

Summary: rgmanager fails to properly start service after node failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	rgmanager
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-05-26 16:42 UTC by Justin Nemmers
Modified:	2009-04-16 20:20 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2006-0557
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-08-15 19:52:41 UTC
Embargoed:

Attachments	(Terms of Use)
Fixes behavior (555 bytes, text/x-patch) 2006-06-15 18:06 UTC, Lon Hohberger	no flags	Details
Corrected patch; original was backwards (600 bytes, patch) 2006-06-26 16:31 UTC, Lon Hohberger	no flags	Details \| Diff
Show Obsolete (1) View All

Description Justin Nemmers 2006-05-26 16:42:08 UTC

Description of problem:

If a node fails while service was stopping, the service will not be restarted elsewhere in the cluster, as 
the service gets stuck in "stopping" state.

Version-Release number of selected component (if applicable):
2.6.9-34.0.1

How reproducible:
start service on one node1
type "reboot" on node1, and immediately pull power
service apache should be stuck in a "stopped" state

In our case, this occured because we simulated a failure by pressing (and holding) the power button.  
On first button press, an ACPI event is generated, and the system begins a halt.  After 6 seconds into 
the halt, the system power is removed, resulting in a "failed" status before the cluster services have 
appropriately stopped.



Actual results:
Service apache gets stuck in state "stopped"

Expected results:
Service apache is restarted on new node when former node is marked as failed.

Additional info:
Relevant log messages from non-failing node:

<snip>
May 26 08:52:33 vhaisalnx1 kernel: CMAN: removing node vhaisalnx2 from the cluster : Missed too 
many heartbeats
May 26 08:52:32 vhaisalnx1 last message repeated 9 times
May 26 08:52:33 vhaisalnx1 clurgmgrd[17020]: <debug> Suspend Event 
May 26 08:52:33 vhaisalnx1 fenced[17005]: vhaisalnx2 not a cluster member after 0 sec post_fail_delay
May 26 08:52:33 vhaisalnx1 fenced[17005]: vhaisalnx2 not a cluster member after 0 sec post_fail_delay
May 26 08:52:33 vhaisalnx1 fenced[17005]: fencing node "vhaisalnx2"
May 26 08:52:33 vhaisalnx1 fenced[17005]: fencing node "vhaisalnx2"
May 26 08:52:34 vhaisalnx1 clurgmgrd[17020]: <debug> Sending service states to fd11 
May 26 08:52:45 vhaisalnx1 fenced[17005]: fence "vhaisalnx2" success
May 26 08:52:45 vhaisalnx1 fenced[17005]: fence "vhaisalnx2" success
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Sending service states to fd11 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> NULL cluster event 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Membership Change Event 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> Magma Event: Membership Change 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> Magma Event: Membership Change 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> I am node #1 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> State change: vhaisalnx2 DOWN 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <info> State change: vhaisalnx2 DOWN 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> Evaluating RG apache, state stopping, owner 
(null) 
May 26 08:52:52 vhaisalnx1 clurgmgrd[17020]: <debug> RG apache is stopping 
</snip>

Note that that service does not automatically recover from this state.  The only way to recover from this 
situation is to restart cluster services.

Comment 1 Lon Hohberger 2006-06-15 18:06:18 UTC

Created attachment 130996 [details]
Fixes behavior

This is in CVS and will go out with the next update.

Comment 2 Lon Hohberger 2006-06-26 16:31:37 UTC

Created attachment 131553 [details]
Corrected patch; original was backwards

Comment 6 Lon Hohberger 2006-08-15 19:52:41 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0557.html

Note You need to log in before you can comment on or make changes to this bug.