Bug 1961857
| Summary: | pacemaker seems to end up in an unfence loop | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Michele Baldessari <michele> | |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 8.4 | CC: | cfeist, cluster-maint, dabarzil, dciabrin, jeckersb, jmarcian, kwenning, lmiccini, ltamagno, msmazova, pzimek | |
| Target Milestone: | rc | Keywords: | Triaged, ZStream | |
| Target Release: | 8.5 | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | pacemaker-2.1.0-3.el8 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Pacemaker's controller on a node might be elected the Designated Controller (DC) before its attribute manager learned an already-active remote node is remote.
Consequence: The node's scheduler would not see any of the remote node's node attributes. If the cluster used unfencing, this could result in an unfencing loop.
Fix: The attribute manager can now learn a remote node is remote via additional events, including the initial attribute sync at start-up.
Result: No unfencing loop occurs, regardless of which node is elected DC.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1972273 (view as bug list) | Environment: | ||
| Last Closed: | 2021-11-09 18:44:54 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1972273 | |||
|
Description
Michele Baldessari
2021-05-18 20:22:03 UTC
> I suspect that the root cause is around these two messages: > May 18 20:13:21 messaging-0.redhat.local pacemaker-attrd[141593]: notice: Cannot update #node-unfenced[compute-0]=1621368731 because peer UUID not known (will retry if learned) > May 18 20:13:21 messaging-0.redhat.local pacemaker-attrd[141593]: notice: Cannot update #node-unfenced[compute-1]=1621368801 because peer UUID not known (will retry if learned) You are correct. The attribute manager cannot record the node attribute indicating that the node was unfenced into the CIB until it knows the node UUID, so the scheduler never realizes that the unfencing has already been done. I'll have to investigate the sosreports to see why the UUID is unknown in this particular case, but I'm thinking we can close this as a duplicate of Bug 1905965 -- it's not the same problem, but that fix should take care of this, too. (We would track node-unfenced in the CIB node_state entry rather than as a node attribute.) The fix for Bug 1905965 would help somewhat but would not be a complete fix in this case. The problem is that the node attribute manager doesn't know that the compute nodes are remote nodes. Normally, all cluster nodes learn that a node is a remote node when the remote connection comes up. In this case, messaging-0 was down when the compute nodes came up, so it never learned that. When a node comes up, the other nodes send it all known attributes, so messaging-0 did get a copy of the compute nodes' unfencing-related attributes. However, this sync does not include whether the relevant nodes are cluster nodes or remote nodes, so messaging-0 assumed the compute nodes were unseen cluster nodes, and couldn't record their node attributes. When messaging-0 later became DC, nothing had yet happened that would make its attribute manager learn the remote nodes, so it still hadn't recorded those attributes, and its scheduler thought unfencing was still needed. The fix will be to pass along the cluster vs remote info in the sync with newly joining nodes. In the meantime, the only workaround I see in this situation is restarting the cluster on the affected node, so another node becomes DC. How urgent is this, and what RHEL versions would you need the fix in? Heya Ken, so I don't think we've managed to reproduce this in the last month, so for now I'll cautiously say that we can aim for RHEL-9 for this for the time being. I'll update here if we spot this in the wild some more. thanks, Michele @michele I'm narrowing down the circumstances under which this occurs. I see this after controller-2 left and rejoined: May 18 18:36:05 controller-2.redhat.local pacemaker-based [312491] (cib_process_request) info: Forwarding cib_delete operation for section //cluster_property_set/nvpair[@name='tripleo-shutdown-lock' and @value='controller-2:1621363508']/.. to all (origin=local/cibadmin/2) Are you calling cibadmin immediately after starting an upgraded node? If so, when/how/why? Hi Ken, I'll let Damien comment more in detail, but yes there is a script we use to coordinate certain operations across the control plane that uses some attributes to implement a distributed lock [1] [1] https://github.com/openstack/tripleo-heat-templates/blob/master/container_config_scripts/pacemaker_resource_lock.sh (In reply to Ken Gaillot from comment #6) > Are you calling cibadmin immediately after starting an upgraded node? If so, > when/how/why? This is an idiomatic way for us to control the stop/restart of pacemaker nodes while upgrades are being executed concurrently. The context is that we during upgrade of our Openstack control plane, we want to avoid shutting down several pacemaker nodes at once to avoid some spurious node shutdown [1,2]. More specifically, while some pacemaker nodes are upgraded sequentially to avoid service disruption (e.g. database nodes), there can be different type of nodes being shut down at the same time (a database node and a networker node), and we want to avoid that otherwise we would hit [1] or [2]. So the idea is to add a sort of coordination between nodes: . If a node wants to stop the pacemaker cluster locally, it first needs to set attribute 'tripleo-shutdown-lock' in the CIB and then stop the cluster. . If the attribute is set in the CIB, no other node is allowed to stop a pacemaker cluster node. Node requiring a shutdown needs to wait until the attribute is removed from the CIB, and then try to set it back to signal its own shutdown. . Once a node has stopped pacemaker, and finished its upgrade, it restarts pacemaker locally and clear up the attribute from the CIB. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1791841 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1872404 OK, I understand what happened. The combination of restarting all the cluster nodes one by one, and running cibadmin immediately after they are restarted, triggers a CIB conflict. Pacemaker has an algorithm to keep the CIB synchronized across all cluster nodes. For example, if a cluster is partitioned, the CIB is modified differently in each partition, and then the partition heals, Pacemaker has to pick one of the new CIBs to keep and trash the other. The algorithm gives a preference to changes from cibadmin since they are made by the user. You're using cibadmin to remove the tripleo lock immediately after the node starts the cluster, which creates a timing issue. The cibadmin change can be made effective locally before the node has rejoined the cluster's pacemaker-controld layer. When the node does rejoin, its CIB now looks "better" (because it came from cibadmin) than the cluster's current CIB and overwrites it, losing some information in the process, including transient node attributes. Normally that's not too much of a problem, because the current attribute writer (a role similar to DC, and usually the same node) will write out all transient attributes back to the CIB. However, because every cluster node has been restarted at this point, and the remote nodes have not been restarted, the attribute writer doesn't know that the remote nodes are remote nodes, and can't write out their attributes. This includes unfencing information, so the DC's scheduler now thinks unfencing needs to be redone, and the attributes can never be written, so it keeps looping. I have a fix ready for testing, but because this is a timing issue, a reliable reproducer might be difficult. I'm thinking it should be possible to bring the last cluster node back up with its corosync ports blocked, run cibadmin manually on it, then unblock corosync, but I haven't tried that yet. A workaround (and better solution, really) would be to wait until the node has fully rejoined the cluster before running cibadmin. (In reply to Ken Gaillot from comment #9) > A workaround (and better solution, really) would be to wait until the node > has fully rejoined the cluster before running cibadmin. Thanks for the insight and the great analysis! In parallel, I think we can work out a fix on our side to wait until the node has rejoined the cluster before trying to touch the CIB, so we'll test that out and report shortly. Fixed upstream as of commit 540d7413 FYI, a workaround just occurred to me, though I haven't tested it:
When a remote connection starts or migrates, the node hosting the connection learns the node is remote. Remote connections are live-migratable (no resources on the remote node should be affected by a migration). So, if we live-migrate the remote connection to each cluster node in turn, they all should learn its remoteness.
That can be accomplished by setting a location preference, e.g.
pcs constraint location <remote-resource> prefers <node1>
# wait until connection is finished migrating
pcs constraint location show --full
# get the constraint ID from the output of above
pcs constraint remove <constraint-id>
# repeat for each cluster node
That could be done for just the current DC to fix the immediate problem, but it should be done on any cluster node that has restarted since the remote nodes were up, in case they later become DC.
(In reply to Ken Gaillot from comment #21) > pcs constraint location <remote-resource> prefers <node1> > # wait until connection is finished migrating > pcs constraint location show --full > # get the constraint ID from the output of above > pcs constraint remove <constraint-id> > # repeat for each cluster node A more convenient way of doing the same thing would be: pcs resource move <remote-resource> <node1> --wait pcs resource clear <remote-resource> --wait # repeat for each cluster node Verified. Based on verification procedure of pacemaker-2.0.5-9.el8_4.2 on OSP16.2/RHEL8.4. Please see https://bugzilla.redhat.com/show_bug.cgi?id=1972273#c3 Verified. Based on verification procedure of pacemaker-2.0.5-9.el8_4.2 on OSP16.2/RHEL8.4. Please see https://bugzilla.redhat.com/show_bug.cgi?id=1972273#c3 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:4267 |