Bug 1797579
| Summary: | Pacemaker fencer should retry failed meta-data commands | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Takashi Kajinami <tkajinam> | |
| Component: | pacemaker | Assignee: | Oyvind Albrigtsen <oalbrigt> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | low | |||
| Version: | 8.0 | CC: | cfeist, cluster-maint, kgaillot, lmiccini, mburns, michele, mjuricek, mnovacek, msmazova, pkomarov, pzimek | |
| Target Milestone: | rc | Keywords: | Triaged, ZStream | |
| Target Release: | 8.5 | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | pacemaker-2.1.0-1.el8 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: If a fence agent meta-data request fails, Pacemaker previously would consider that result final. While this would be the best response for a permanent issue such as a nonconforming agent, the failure could be transient, for example when Pacemaker is enabled at boot and heavy CPU and/or I/O load at boot causes the meta-data request to time out.
Consequence: Pacemaker would not know if the fence agent supports unfencing or other optional features, and would act as if it did not, which could cause unfencing failures and other issues.
Fix: If Pacemaker is unable to obtain fence agent meta-data on the first try, it will now periodically retry.
Result: Although problems could still occur during a short time window, the meta-data will be obtained when feasible if the failure was transient, and the issues should be recoverable.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1992014 (view as bug list) | Environment: | ||
| Last Closed: | 2021-11-09 18:44:49 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | 2.1.0 | |
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1992014 | |||
|
Description
Takashi Kajinami
2020-02-03 12:47:45 UTC
When a fence device is registered with Pacemaker's fencer, which happens when the cluster starts for devices already in the configuration, the fencer attempts to execute the fence agent's meta-data command. If this fails, the fencer is unable to know what the agent supports (such as the list or status commands for dynamic host lists), and so assumes no capabilities. This can cause problems (such as assuming the device can fence all hosts, rather than attempting to get a dynamic host list). In this case, it appears the meta-data command timed out when pacemaker was started at host boot, and other processes were causing disk I/O slowness. To be more resilient during such transient issues, a possible enhancement would be for the fencer to remember when a meta-data command failed, and retry it periodically. RHEL 7 is no longer receiving such enhancements, so I am reassigning this bz to RHEL 8. *** Bug 1844500 has been marked as a duplicate of this bug. *** Fixed upstream as of commit 1d33712 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:4267 |