Bug 1869728
Summary: | sbd always triggers a reboot while with no-quorum-action=stop assuring that all resources are down within watchdog-timeout might be safe enough | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Klaus Wenninger <kwenning> |
Component: | sbd | Assignee: | Klaus Wenninger <kwenning> |
Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 8.3 | CC: | cfeist, cluster-maint, kgaillot |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | 8.4 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-02-18 07:27:18 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Klaus Wenninger
2020-08-18 14:08:02 UTC
Discussion Discussion revealed that rejoining such a node that has successfully stopped resources within the given timeout won't mimic a fence-reboot well enough for the rest of the cluster. Examples are e.g. leftover transient attributes and in general this would probably impose some burden on testing as we'd somehow have to ensure behavior being comparable to after a reboot. In general we might think over use-cases where it really is that desirable to prevent a reboot. A list for brain-storming could start with: - certain server-hardware is quite slow on a reboot while a quorum-loss might go away quickly and we could recover the cluster quicker - it is always a pain for an admin to find a shell he was using to observe the node behavior to be starved/closed because of a reboot - the node might run services outside of pacemaker-control that would be unnecessarily affected - ... These arguments are valid for most cluster-scenarios but might be more annoying with watchdog fencing as we might expect issues happening more frequently. Should a cluster-node run anything but services under pacemaker control - maybe not - maybe there are reasons why it makes sense ... Another possibility that came to my mind was introduction of a new no-quorum-policy=shutdown (or whatever imposes less risk of missunderstanding) that would make the node attempt a graceful pacemaker-shutdown. SBD would again allow watchdog-timeout for this to happen and if it detects a graceful-shutdown of pacemaker (without resources running - meaning not in maintenance mode) it would be content and not trigger an actual reboot. Like this from a testing-perspective we would have the same case as a manual service-stop/start. (In reply to Klaus Wenninger from comment #2) > Discussion revealed that rejoining such a node that has successfully stopped > resources within the given timeout won't mimic a fence-reboot well enough > for the rest of the cluster. Examples are e.g. leftover transient attributes > and in general this would probably impose some burden on testing as we'd > somehow have to ensure behavior being comparable to after a reboot. It's a tough question. There's no way to mimic what happens with nodes not using sbd: - If any nodes retain quorum, they will fence the nodes without quorum. - Any node that loses quorum will stop resources if it's able, but leave pacemaker running so it can rejoin the cluster if quorum is regained before fencing is scheduled against it. The problem of course is that sbd can't know if any other nodes retain quorum, so it has to fence to be safe. As you suggest, if we can absolutely guarantee that all resources are stopped, and pacemaker and corosync are restarted, then perhaps fencing should be considered unnecessary. On the other hand, sbd can't guarantee that pacemaker and corosync behave correctly once restarted, which may violate assumptions held by any surviving partition. We could stop pacemaker and corosync instead of restarting them, but then the node can't rejoin if quorum is regained, so the only practical benefit is less chance of losing logs. > In general we might think over use-cases where it really is that desirable > to prevent a reboot. > > A list for brain-storming could start with: > - certain server-hardware is quite slow on a reboot while a quorum-loss > might go away quickly and we could recover the cluster quicker Just brainstorming, what about a separate quorum loss timeout? If pacemaker detects sbd running and sees a quorum loss timeout from the sbd sysconfig, it would wait that long before declaring the node fencing successful. The timeout would have to be identical on all nodes. That would slow down quorum recovery for the chance of the node rejoining more quickly. Users would have to balance the two concerns. > - it is always a pain for an admin to find a shell he was using to observe > the node behavior to be starved/closed because of a reboot > - the node might run services outside of pacemaker-control that would be > unnecessarily affected I don't think that's an issue since fencing is always a possibility, so the admin must already incorporate that into any policy regarding non-clustered services. > - ... > These arguments are valid for most cluster-scenarios but might be more > annoying with watchdog fencing as we might expect issues happening more > frequently. > Should a cluster-node run anything but services under pacemaker control - > maybe not - maybe there are reasons why it makes sense ... > > Another possibility that came to my mind was introduction of a new > no-quorum-policy=shutdown (or whatever imposes less risk of > missunderstanding) that would make the node attempt a graceful > pacemaker-shutdown. SBD would again allow watchdog-timeout for this to > happen and if it detects a graceful-shutdown of pacemaker (without resources > running - meaning not in maintenance mode) it would be content and not > trigger an actual reboot. > Like this from a testing-perspective we would have the same case as a manual > service-stop/start. Per above, I think the problem is that either the node can't rejoin if quorum is regained, or we risk corosync/pacemaker operating without any observation or check from the quorate partition. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |