From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Description of problem: When a custom <script> resource encounters an unrecoverable failure at application/resource level, it needs to be able to trigger a failover to another node. Currently failover is initiated only for heartbeat problems at CMAN level. When heartbeat is okay, rgmanager keeps calling "recover" command on the script, which keeps returning error. A possible solution is to define a specific exit code for the script that indicates to rgmanager the application should be failed over. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Define a <script> resource that returns non-zero exit code for "status" and "recover" commands. "start" and "stop" commands should work normally. Additional info:
Would it work to simply add a per-group policy: "Relocate-on-failure" vs. "restart-on-failure" ? Also, it looks like the script resource shouldn't actually _have_ a "recover" option, not that this would have any affect on the behavior. Most system-level init scripts (which is what the script wrapper is supposed to handle) have never heard of "recover" actions, and would just return an error code (or worse: success...).
I agree script resource doesn't need a recover action. (But rgmanager/src/resources/script.sh currently defines it.) "Relocate-on-failure" vs. "restart-on-failure" policy option would work, but with relocate option there may be a possibility of infinite relocation loop. If you have a situation where resource fails to start regardless of node (perhaps due to misconfiguration or problem with common SAN), the resource will keep relocating. There needs to be a mechanism to detect this case and disable the resource after a while.
There was a bug where a node kept trying to restart services locally instead of relocating them in the event of a "failed start". This is related, but not the same as what's in this bugzilla.
Recovery policy (restart/relocate/disable) is in CVS. Recovery was typically 3 steps in clumanager/rgmanager: (1) Restart locally. If the "start" phase fails, we proceed to step 2. (2) Relocate to another node. If the "start" phase fails on the other node, we try the next legal target until all targets are exhausted. (3) If all nodes are exhausted, stop the service and wait for the next node transition to recover it. restart: Default. Reflects the above recovery mechanism for resource groups. relocate: Skip step (1). disable: Don't bother trying to recover the service. Disable it (= prevent it from running). There is some expectancy that the resource agents/scripts have some intelligence and fail to start if there's serious configuration problem. The changes in CVS don't address detection of "ping-pong service" effect which is caused by applications crashing soon after startup (but reporting successful startup). This is a different issue.