Red Hat Bugzilla – Bug 149735
Allow resource level failure to trigger failover
Last modified: 2009-04-16 16:16:29 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5)
Description of problem:
When a custom <script> resource encounters an unrecoverable failure at
application/resource level, it needs to be able to trigger a failover
to another node. Currently failover is initiated only for heartbeat
problems at CMAN level. When heartbeat is okay, rgmanager keeps
calling "recover" command on the script, which keeps returning error.
A possible solution is to define a specific exit code for the script
that indicates to rgmanager the application should be failed over.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Define a <script> resource that returns non-zero exit code for
"status" and "recover" commands. "start" and "stop" commands should
Would it work to simply add a per-group policy:
"Relocate-on-failure" vs. "restart-on-failure" ?
Also, it looks like the script resource shouldn't actually _have_ a
"recover" option, not that this would have any affect on the
behavior. Most system-level init scripts (which is what the script
wrapper is supposed to handle) have never heard of "recover" actions,
and would just return an error code (or worse: success...).
I agree script resource doesn't need a recover action. (But
rgmanager/src/resources/script.sh currently defines it.)
"Relocate-on-failure" vs. "restart-on-failure" policy option would
work, but with relocate option there may be a possibility of infinite
relocation loop. If you have a situation where resource fails to
start regardless of node (perhaps due to misconfiguration or problem
with common SAN), the resource will keep relocating. There needs to
be a mechanism to detect this case and disable the resource after a while.
There was a bug where a node kept trying to restart services locally
instead of relocating them in the event of a "failed start". This is
related, but not the same as what's in this bugzilla.
Recovery policy (restart/relocate/disable) is in CVS.
Recovery was typically 3 steps in clumanager/rgmanager:
(1) Restart locally. If the "start" phase fails, we proceed to step 2.
(2) Relocate to another node. If the "start" phase fails on the other
node, we try the next legal target until all targets are exhausted.
(3) If all nodes are exhausted, stop the service and wait for the next
node transition to recover it.
restart: Default. Reflects the above recovery mechanism for resource
relocate: Skip step (1).
disable: Don't bother trying to recover the service. Disable it (=
prevent it from running).
There is some expectancy that the resource agents/scripts have some
intelligence and fail to start if there's serious configuration problem.
The changes in CVS don't address detection of "ping-pong service"
effect which is caused by applications crashing soon after startup
(but reporting successful startup). This is a different issue.