1872404 – restarting nodes in parallel while maintaining quorum creates an unexpected node shutdown.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1872404 - restarting nodes in parallel while maintaining quorum creates an unexpected node shutdown.

Summary: restarting nodes in parallel while maintaining quorum creates an unexpected n...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	8.6
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-25 16:14 UTC by Sofer Athlan-Guyot
Modified:	2022-10-24 13:52 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Pacemaker's controller processed corosync notifications of nodes leaving individually. Consequence: If the DC and another node left at the same time, but the other node's corosync notification was processed first, then the other node's transient node attributes would not be cleared. This could cause problems such as the node being shut down again when it rejoins the cluster due to the shutdown attribute remaining. Fix: The controller now looks through all corosync notifications to see if the DC is leaving, and if so, all nodes will remove leaving nodes' attributes. Result: Transient attributes do not get "stuck" when a node leaves at the same time as the DC, and the node is not wrongly shut down when it rejoins the cluster.
Clone Of:
Environment:
Last Closed:	2022-02-07 15:32:12 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Sofer Athlan-Guyot 2020-08-25 16:14:46 UTC

Description of problem:  This bug seems to be a direct copy of https://bugzilla.redhat.com/show_bug.cgi?id=1791841 where a node shutdown is unexpectedly programmed during a parallel update of a cluster by the DC.

I copy/paste/adjust the relevant part from Michele's comment for convenience.


--- Michele --
We have a cluster with 9 full pcmk-nodes (no remotes), hosting the usual OSP resources (bunch of bundles and a few VIPs). The nodes are:
controller-0, controller-1, controller-2
messaging-0, messaging-1, messaging-2
database-0, database-1, database-3

Now when we trigger a minor update we basically do the following pseudo-operations:
update() {
  pcs cluster stop
  bunch of container updates
  pcs cluster start
}

for i in 0 1 2; do
  A) update() on controller-$i
  B) update() on messaging-$i
  C) update() on database-$i
done

Note that A,B,C happen in parallel. We do this because we do not lose quorum and
because it speeds things down considerable (2 hours less in the update process, than if we did it serially on each node. NB: when we do this serially no problem is observed).

--- End Michele ---

In the log we have: 

database-1 is the current DC and database-2 is just joining the cluster.  Then database-1 tells database-2 to shutdown:


Aug 24 19:50:55 controller-1 pacemaker-schedulerd[36556] (sched_shutdown_op)    notice: Scheduling shutdown of node database-2
Aug 24 19:50:55 controller-1 pacemaker-schedulerd[36556] (LogNodeActions)       notice:  * Shutdown database-2


database-2 see this as an unexpected action:

Aug 24 19:50:35 database-2 pacemaker-controld  [358348] (update_dc)     info: Set DC to controller-1 (3.2.0)

Aug 24 19:50:55 database-2 pacemaker-controld  [358348] (handle_request)        error: We didn't ask to be shut down, yet our DC is telling us to.


Version-Release number of selected component (if applicable):

The versions used are:
                    
pacemaker.x86_64                              2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-ha             
pacemaker-cli.x86_64                          2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-ha             
pacemaker-cluster-libs.x86_64                 2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-appstream      
pacemaker-libs.x86_64                         2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-appstream      
pacemaker-remote.x86_64                       2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-ha             
pacemaker-schemas.noarch                      2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-appstream 

and inside the container:

pacemaker-schemas-2.0.3-5.el8_2.1.noarch
pacemaker-cli-2.0.3-5.el8_2.1.x86_64
pacemaker-remote-2.0.3-5.el8_2.1.x86_64
pacemaker-libs-2.0.3-5.el8_2.1.x86_64
pacemaker-cluster-libs-2.0.3-5.el8_2.1.x86_64
pacemaker-2.0.3-5.el8_2.1.x86_64

Steps to Reproduce:
1. run a parallel role update of osp16.0/rhel8.1 -> osp16.1/rhel8.2

Comment 2 Ken Gaillot 2020-09-09 16:32:57 UTC

This is indeed very similar to Bug 1791841 -- unfortunately there is still a timing issue that can result in the same effect. The problematic sequence is:

1. The elected DC at the time and some other node both initiate shutdown (by setting the shutdown node attribute) at the same time. (In this case, messaging-2 is the DC and database-2 is the other node, at 19:37:54.)

2. The DC and other node complete stopping resources and leave the cluster at nearly the same time, but corosync delivers the notification for the other node's departure first.

3. Since the DC has not yet left the cluster when the other node's departure notification is received by the surviving nodes, the surviving nodes assume the DC has erased the leaving node's attributes. However the DC has actually just left and does not erase them, leaving the shutdown attribute in place when the other node rejoins. (The surviving nodes will correctly erase the DC's own attributes when they get notified of its departure shortly after.)

Our current approach for when a node leaves the cluster is to have the DC erase the node's attributes if there is an elected DC, otherwise all surviving nodes will erase the node's attributes (i.e. if either there is no DC or the leaving node is the DC). I think this is fundamentally subject to timing issues such as this and will have to be completely rethought.

Comment 9 Ken Gaillot 2022-03-25 16:55:48 UTC

This was actually fixed in the upstream 2.1.1 release by commit ee7eba6a, which lands in RHEL 8.6. There is not time to get this bz formally added to 8.6 this late in the release process, but the fix will be there.

Note You need to log in before you can comment on or make changes to this bug.