Bug 1688149
| Summary: | pacemaker cluster will never settle | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | michal novacek <mnovacek> | ||||
| Component: | pacemaker | Assignee: | Reid Wahl <nwahl> | ||||
| Status: | ON_QA --- | QA Contact: | cluster-qe <cluster-qe> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 8.0 | CC: | cluster-maint, jrehova, kgaillot, lmiksik, nwahl | ||||
| Target Milestone: | pre-dev-freeze | Keywords: | Reopened, Triaged | ||||
| Target Release: | 8.9 | ||||||
| Hardware: | All | ||||||
| OS: | All | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | pacemaker-2.1.6-4.el8 | Doc Type: | Bug Fix | ||||
| Doc Text: |
Cause: Pacemaker previously assigned clone instances to equally scored nodes without considering the instances' current nodes.
Consequence: If a clone had equally scored location constraints on a subset of nodes, clone instances could be assigned to a different node each time and continuously stopped and restarted by the cluster.
Fix: Instances are now assigned to their current node whenever possible.
Result: Clone instances do not get restarted unnecessarily.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-02-01 07:39:27 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | 2.1.7 | ||||
| Embargoed: | |||||||
| Bug Depends On: | 1682116 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
michal novacek
2019-03-13 09:50:45 UTC
It's actually not running happily; starting at Mar 12 16:03:37 in the logs (when the LVM-activate/Filesystem/VirtualDomain resources and their constraints are added), the resources are continuously restarting. :( Removing the location constraints for the Filesystem resources seems to work around the problem. (They are equally scored location constraints for both nodes in a symmetric cluster, so they have no effect.) I also see location constraints keeping various resources off the VirtualDomain resources. Those are not Pacemaker Remote nodes, so the constraints do not mean anything. However those constraints aren't causing any problems. There is a separate issue with the simulation (but not the cluster) thinking the fence devices need to be restarted. That might interfere with the --wait as well. This is a known issue that has not been investigated. Can you try the workaround and see if it helps? We need to fix the underlying issues, but given how difficult it is to get anything into GA at this point, a workaround would be good to have. I can confirm that removing the positive constraint for filesystem works around the problem. An update: (In reply to Ken Gaillot from comment #1) > It's actually not running happily; starting at Mar 12 16:03:37 in the logs > (when the LVM-activate/Filesystem/VirtualDomain resources and their > constraints are added), the resources are continuously restarting. :( Looking at the logs more closely, I was off a bit: the configuration was being repeatedly changed during this time, so resources were starting and stopping appropriately. Problems actually start at Mar 12 16:32:27. > Removing the location constraints for the Filesystem resources seems to work > around the problem. (They are equally scored location constraints for both > nodes in a symmetric cluster, so they have no effect.) Changing the location constraints to have a score less than INFINITY also works around the problem. Pacemaker assigns an instance number to clone instances on each node. What is going wrong here is that every time Pacemaker runs its scheduler, it assigns different instance numbers to the existing active instances compared to what it wants the final result to be, so it thinks the instances need to be moved. The cause for that still needs to be found and fixed. > There is a separate issue with the simulation (but not the cluster) thinking > the fence devices need to be restarted. That might interfere with the --wait > as well. This is a known issue that has not been investigated. As an aside, the simulation issue has been fixed, though the fix will not make it into RHEL 8.4. However, that issue does not affect --wait when used with a live cluster. After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. (In reply to RHEL Program Management from comment #13) > After evaluating this issue, there are no plans to address it further or fix > it in an upcoming release. Therefore, it is being closed. If plans change > such that this issue will be fixed in an upcoming release, then the bug can > be reopened. This is still a high priority and I am hopeful the fix will be in RHEL 8.5. Once we are further along in 8.5 release planning, we will likely reopen this. This is fixed by upstream commit 018ad6d5. |