Bug 149735

Summary:	Allow resource level failure to trigger failover
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Jiho Hahm <jhahm>
Component:	rgmanager	Assignee:	Lon Hohberger <lhh>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	cluster-maint
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-07-12 15:40:44 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jiho Hahm 2005-02-25 21:07:49 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5)
Gecko/20041107 Firefox/1.0

Description of problem:
When a custom <script> resource encounters an unrecoverable failure at
application/resource level, it needs to be able to trigger a failover
to another node.  Currently failover is initiated only for heartbeat
problems at CMAN level.  When heartbeat is okay, rgmanager keeps
calling "recover" command on the script, which keeps returning error.

A possible solution is to define a specific exit code for the script
that indicates to rgmanager the application should be failed over.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:

1. Define a <script> resource that returns non-zero exit code for
"status" and "recover" commands.  "start" and "stop" commands should
work normally.


Additional info:

Comment 1 Lon Hohberger 2005-02-25 21:20:57 UTC

Would it work to simply add a per-group policy:

"Relocate-on-failure" vs. "restart-on-failure" ?

Also, it looks like the script resource shouldn't actually _have_ a
"recover" option,  not that this would have any affect on the
behavior.  Most system-level init scripts (which is what the script
wrapper is supposed to handle) have never heard of "recover" actions,
and would just return an error code (or worse: success...).

Comment 2 Jiho Hahm 2005-02-25 23:12:26 UTC

I agree script resource doesn't need a recover action.  (But
rgmanager/src/resources/script.sh currently defines it.)

"Relocate-on-failure" vs. "restart-on-failure" policy option would
work, but with relocate option there may be a possibility of infinite
relocation loop.  If you have a situation where resource fails to
start regardless of node (perhaps due to misconfiguration or problem
with common SAN), the resource will keep relocating.  There needs to
be a mechanism to detect this case and disable the resource after a while.

Comment 3 Lon Hohberger 2005-02-28 23:16:40 UTC

There was a bug where a node kept trying to restart services locally
instead of relocating them in the event of a "failed start".  This is
related, but not the same as what's in this bugzilla.

Comment 4 Lon Hohberger 2005-03-02 17:32:45 UTC

Recovery policy (restart/relocate/disable) is in CVS.

Recovery was typically 3 steps in clumanager/rgmanager:
(1) Restart locally.  If the "start" phase fails, we proceed to step 2.
(2) Relocate to another node.  If the "start" phase fails on the other
node, we try the next legal target until all targets are exhausted.
(3) If all nodes are exhausted, stop the service and wait for the next
node transition to recover it.

restart: Default.  Reflects the above recovery mechanism for resource
groups.

relocate: Skip step (1).

disable: Don't bother trying to recover the service.  Disable it (=
prevent it from running).

There is some expectancy that the resource agents/scripts have some
intelligence and fail to start if there's serious configuration problem.

The changes in CVS don't address detection of "ping-pong service"
effect which is caused by applications crashing soon after startup
(but reporting successful startup).  This is a different issue.