Bug 149735
Summary: | Allow resource level failure to trigger failover | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Jiho Hahm <jhahm> |
Component: | rgmanager | Assignee: | Lon Hohberger <lhh> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | cluster-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-07-12 15:40:44 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jiho Hahm
2005-02-25 21:07:49 UTC
Would it work to simply add a per-group policy: "Relocate-on-failure" vs. "restart-on-failure" ? Also, it looks like the script resource shouldn't actually _have_ a "recover" option, not that this would have any affect on the behavior. Most system-level init scripts (which is what the script wrapper is supposed to handle) have never heard of "recover" actions, and would just return an error code (or worse: success...). I agree script resource doesn't need a recover action. (But rgmanager/src/resources/script.sh currently defines it.) "Relocate-on-failure" vs. "restart-on-failure" policy option would work, but with relocate option there may be a possibility of infinite relocation loop. If you have a situation where resource fails to start regardless of node (perhaps due to misconfiguration or problem with common SAN), the resource will keep relocating. There needs to be a mechanism to detect this case and disable the resource after a while. There was a bug where a node kept trying to restart services locally instead of relocating them in the event of a "failed start". This is related, but not the same as what's in this bugzilla. Recovery policy (restart/relocate/disable) is in CVS. Recovery was typically 3 steps in clumanager/rgmanager: (1) Restart locally. If the "start" phase fails, we proceed to step 2. (2) Relocate to another node. If the "start" phase fails on the other node, we try the next legal target until all targets are exhausted. (3) If all nodes are exhausted, stop the service and wait for the next node transition to recover it. restart: Default. Reflects the above recovery mechanism for resource groups. relocate: Skip step (1). disable: Don't bother trying to recover the service. Disable it (= prevent it from running). There is some expectancy that the resource agents/scripts have some intelligence and fail to start if there's serious configuration problem. The changes in CVS don't address detection of "ping-pong service" effect which is caused by applications crashing soon after startup (but reporting successful startup). This is a different issue. |