Bug 1905820
Summary: | LVM-activate: Unexpected behavior when data disks get DID_BAD_TARGET on Passive Node | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | sreekantha <sreekantha> | ||||
Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 8.2 | CC: | agk, cluster-maint, fdinitto, jobaker, mjuricek, nwahl, phagara, rraghotham, sdivya, singhrobin, sramasamy, tutikas | ||||
Target Milestone: | rc | Keywords: | Triaged | ||||
Target Release: | 8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | resource-agents-4.9.0-30.el8 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 2102126 (view as bug list) | Environment: | |||||
Last Closed: | 2023-05-16 08:04:04 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 2102126 | ||||||
Attachments: |
|
Description
sreekantha
2020-12-09 07:19:35 UTC
Thanks for reporting this. It happens because the LVM-activate resource returns OCF_ERR_CONFIGURED if it determines that the volume group doesn't exist. OCF_ERR_CONFIGURED is a fatal error, which prevents the resource from starting **anywhere** until the failure is cleared. I too found this issue while working on some improvements for BZ1902433. I'm planning to submit a PR, which will likely change the failure code to OCF_ERR_GENERIC, to address this issue. I think I understand what the resource agent author was trying to do here -- if the volume group doesn't exist, then the resource can't run -- although even then, I think OCF_ERR_INSTALLED would be more appropriate since the VG might be present on another node. However, this neglects a couple of edge cases, such as: - PVs temporarily missing due to an iSCSI connection issue - PV with DID_BAD_TARGET error In cases like that, there could be an error that's transient in nature or only occurring on one node. Currently the resource agent responds to that by preventing the resource from running anywhere after a single failure. You can find more info about the return codes and how Pacemaker handles them here: - https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Administration/#_how_are_ocf_return_codes_interpreted Reassigning to resource-agents. Not a Pacemaker issue. Created attachment 1737852 [details]
var/log/message and pacemaker logs of all the 3 nodes
Any updates on this bug? (In reply to sreekantha from comment #5) > Any updates on this bug? I have a draft patch at the following URL: https://github.com/ClusterLabs/resource-agents/pull/1600 You're welcome to try it on a test system. At the time of this writing (as I explain in the current latest comment[1]), there is still more work to be done so that the resource operations work well when the partial_activation attribute is set to true. However, if you leave partial_activation unset (as most users do), then that shouldn't matter to you. [1] https://github.com/ClusterLabs/resource-agents/pull/1600#issuecomment-753458579 Reid Wahl, Since the fix is in progress, until the fix is in place, can we have this documented in KB? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2735 |