Bug 1627948
| Summary: | Bundle component actions can be scheduled on wrong node | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Marian Krcmarik <mkrcmari> | ||||||||
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||||
| Severity: | low | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | 7.5 | CC: | abeekhof, aherr, cluster-maint, michele, mkrcmari, phagara | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | 7.7 | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | pacemaker-1.1.20-1.el7 | Doc Type: | No Doc Update | ||||||||
| Doc Text: |
The issue is mostly invisible to end users.
|
Story Points: | --- | ||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2019-08-06 12:53:44 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
Created attachment 1482498 [details]
sosreports from pcmk remote node
Created attachment 1482512 [details]
The rest of the sosreports for DC
(In reply to Marian Krcmarik from comment #0) > Created attachment 1482497 [details] > part of sosreports (sos_commands/pacemaker) > > Description of problem: > pacemaker remote node does not recover from failover correctly. The observed > flow is as following: > - hard reset is triggered on pacemaker remote node which is part of cluster > - the boot of the remote node is delayed > - pacemaker detects that the remote node is down and performs fencing > - the remote node has delayed booting > - for some reason pacemaker tries to start pacemaker:remote resource even > though the remote node is down (and even It displays the remote node as > online) we have no idea when the remote will come back, so we try anyway - knowing that the failures will time out and we'll retry later. > - it fails to start pacemaker:remote resource on all full pacemaker nodes > and fence the node again it should be blocking, not fencing. i suspect something might have changed :-( > - now it marks the pacemaker:remote resource as stopped and remote node as > offline even though it eventually finally comes back online and > pacemaker_remote daemon starts to work fully. > > This was observed in Openstack based cluster with 3 full pacemaker nodes > called controller-* and 6 pacemaker remote nodes > > I am attaching some sosreports, full sosreport from database-0 pacemaker > remote node which was reset at 21:23:04 Sept 11 of log time and then part of > sosreport from DC (controller-0): > pacemaker.tar.xz contains part of sosreport from sos_commands/pacemaker > sosreport-controller-1-20180911214317.tar.xz contains the rest of the > sosreports > (I am not able to place the logs anywhere I used to, it does not work) > > Version-Release number of selected component (if applicable): > pacemaker-cluster-libs-1.1.18-11.el7_5.3.x86_64 > pacemaker-libs-1.1.18-11.el7_5.3.x86_64 > puppet-pacemaker-0.7.2-0.20180423212251.el7ost.noarch > pacemaker-remote-1.1.18-11.el7_5.3.x86_64 > pacemaker-cli-1.1.18-11.el7_5.3.x86_64 > ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch > pacemaker-1.1.18-11.el7_5.3.x86_64 > > How reproducible: > Always > > Steps to Reproduce: > 1. I deploy Openstack based cluster with 3 full pacemaker nodes and 6 remote > pacemaker nodes (see attached reports for specific cluster configuration) > 2. Reset one of the remote node and delay the boot. > > Actual results: > pacemaker:remote resource for that remote node will end up in stopped state > and no resource on that node will be started > > Expected results: > > > Additional info: There are multiple issues. All of the issues arise from the bundle's remote connection being hosted on a different node than the bundle's container. That's required here because the container is hosted on a remote node, but it triggered some bugs in failure handling. One issue is that an inappropriate clearing of the container's fail count is scheduled, and on the wrong node to boot. Because it's on the wrong node, it doesn't really hurt anything, it just causes a new transition when it times out. I believe I have a fix for this one. Another issue is that a stop of the container is scheduled on the cluster node hosting the container's remote connection (which is not hosting the container). Since a stop of an already stopped resource is a success, this doesn't cause any serious harm either. I'm still investigating a fix for this. The most significant issue is that the remote node is fenced a second time unnecessarily. This appears to have been introduced by a fix in upstream version 1.1.17 (RHEL 7.5) that ensured that unrecoverable remote nodes are fenced even if no resources can run on them. I am still investigating this one as well. After further investigation, most of the concerns here are expected behavior, and can be modified with appropriate configuration. If a remote node is fenced, by default Pacemaker immediately and repeatedly tries reconnecting to it. If the remote node does not come back in time, the reconnect attempts will fail on all nodes, causing a second fencing and preventing the connection from being attempted again until the failure is cleaned. The second fencing is necessary because Pacemaker does not know whether a failed start leaves the resource cleanly stopped or partially started. (This is questionable in the case of ocf:pacemaker:remote, which the cluster does have more direct knowledge of, but it makes sense as a general behavior.) Two configuration settings affect this behavior: * If the ocf:pacemaker:remote agent's reconnect_interval parameter is set, Pacemaker will try to connect to the node at this interval rather than immediately. This gives the node more time to come back up, potentially avoiding the failures and second fencing. * A failure-timeout (as usual) can expire the start failures, allowing the cluster to retry connecting again. The only bugs here are the wrongly scheduled clear_failcount and stop actions, which are rather low impact. I have fixes for these, however it would require an ABI compatibility break, so I am cautious about putting it into RHEL 7. If you think it's important enough, we can do it, or if you think it's important to avoid the second fencing in this case I can investigate that more, but otherwise I'll merge those fixes upstream only for now and close this. After further consideration, I do want to backport the fixes for the wrongly scheduled actions to RHEL 7.7. We can use this BZ to track those. I do not think we will need z-streams. The libpe_status API has up to this point been undocumented, so technically it is not a public API yet. Along with this change, we can add API documentation, so we are technically introducing the API rather than breaking it. The only practical effect should be that sbd will need to be rebuilt. Fixed in upstream master branch (for RHEL 8) by commits 64852e3a through 556796e0, and in upstream 1.1 branch (for RHEL 7) by commits 163742c4 through 58e4eb80 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2129 |
Created attachment 1482497 [details] part of sosreports (sos_commands/pacemaker) Description of problem: pacemaker remote node does not recover from failover correctly. The observed flow is as following: - hard reset is triggered on pacemaker remote node which is part of cluster - the boot of the remote node is delayed - pacemaker detects that the remote node is down and performs fencing - the remote node has delayed booting - for some reason pacemaker tries to start pacemaker:remote resource even though the remote node is down (and even It displays the remote node as online) - it fails to start pacemaker:remote resource on all full pacemaker nodes and fence the node again - now it marks the pacemaker:remote resource as stopped and remote node as offline even though it eventually finally comes back online and pacemaker_remote daemon starts to work fully. This was observed in Openstack based cluster with 3 full pacemaker nodes called controller-* and 6 pacemaker remote nodes I am attaching some sosreports, full sosreport from database-0 pacemaker remote node which was reset at 21:23:04 Sept 11 of log time and then part of sosreport from DC (controller-0): pacemaker.tar.xz contains part of sosreport from sos_commands/pacemaker sosreport-controller-1-20180911214317.tar.xz contains the rest of the sosreports (I am not able to place the logs anywhere I used to, it does not work) Version-Release number of selected component (if applicable): pacemaker-cluster-libs-1.1.18-11.el7_5.3.x86_64 pacemaker-libs-1.1.18-11.el7_5.3.x86_64 puppet-pacemaker-0.7.2-0.20180423212251.el7ost.noarch pacemaker-remote-1.1.18-11.el7_5.3.x86_64 pacemaker-cli-1.1.18-11.el7_5.3.x86_64 ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch pacemaker-1.1.18-11.el7_5.3.x86_64 How reproducible: Always Steps to Reproduce: 1. I deploy Openstack based cluster with 3 full pacemaker nodes and 6 remote pacemaker nodes (see attached reports for specific cluster configuration) 2. Reset one of the remote node and delay the boot. Actual results: pacemaker:remote resource for that remote node will end up in stopped state and no resource on that node will be started Expected results: Additional info: