Bug 149735 - Allow resource level failure to trigger failover
Summary: Allow resource level failure to trigger failover
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: rgmanager
Version: 4
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-02-25 21:07 UTC by Jiho Hahm
Modified: 2009-04-16 20:16 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-07-12 15:40:44 UTC
Embargoed:


Attachments (Terms of Use)

Description Jiho Hahm 2005-02-25 21:07:49 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5)
Gecko/20041107 Firefox/1.0

Description of problem:
When a custom <script> resource encounters an unrecoverable failure at
application/resource level, it needs to be able to trigger a failover
to another node.  Currently failover is initiated only for heartbeat
problems at CMAN level.  When heartbeat is okay, rgmanager keeps
calling "recover" command on the script, which keeps returning error.

A possible solution is to define a specific exit code for the script
that indicates to rgmanager the application should be failed over.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:

1. Define a <script> resource that returns non-zero exit code for
"status" and "recover" commands.  "start" and "stop" commands should
work normally.


Additional info:

Comment 1 Lon Hohberger 2005-02-25 21:20:57 UTC
Would it work to simply add a per-group policy:

"Relocate-on-failure" vs. "restart-on-failure" ?

Also, it looks like the script resource shouldn't actually _have_ a
"recover" option,  not that this would have any affect on the
behavior.  Most system-level init scripts (which is what the script
wrapper is supposed to handle) have never heard of "recover" actions,
and would just return an error code (or worse: success...).


Comment 2 Jiho Hahm 2005-02-25 23:12:26 UTC
I agree script resource doesn't need a recover action.  (But
rgmanager/src/resources/script.sh currently defines it.)

"Relocate-on-failure" vs. "restart-on-failure" policy option would
work, but with relocate option there may be a possibility of infinite
relocation loop.  If you have a situation where resource fails to
start regardless of node (perhaps due to misconfiguration or problem
with common SAN), the resource will keep relocating.  There needs to
be a mechanism to detect this case and disable the resource after a while.

Comment 3 Lon Hohberger 2005-02-28 23:16:40 UTC
There was a bug where a node kept trying to restart services locally
instead of relocating them in the event of a "failed start".  This is
related, but not the same as what's in this bugzilla.


Comment 4 Lon Hohberger 2005-03-02 17:32:45 UTC
Recovery policy (restart/relocate/disable) is in CVS.

Recovery was typically 3 steps in clumanager/rgmanager:
(1) Restart locally.  If the "start" phase fails, we proceed to step 2.
(2) Relocate to another node.  If the "start" phase fails on the other
node, we try the next legal target until all targets are exhausted.
(3) If all nodes are exhausted, stop the service and wait for the next
node transition to recover it.

restart: Default.  Reflects the above recovery mechanism for resource
groups.

relocate: Skip step (1).

disable: Don't bother trying to recover the service.  Disable it (=
prevent it from running).

There is some expectancy that the resource agents/scripts have some
intelligence and fail to start if there's serious configuration problem.

The changes in CVS don't address detection of "ping-pong service"
effect which is caused by applications crashing soon after startup
(but reporting successful startup).  This is a different issue.


Note You need to log in before you can comment on or make changes to this bug.