Bug 1872404

Summary:	restarting nodes in parallel while maintaining quorum creates an unexpected node shutdown.
Product:	Red Hat Enterprise Linux 8	Reporter:	Sofer Athlan-Guyot <sathlang>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED CURRENTRELEASE	QA Contact:	cluster-qe <cluster-qe>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	8.2	CC:	bdobreli, cfeist, cluster-maint, dhill, kgaillot, michele, rhayakaw
Target Milestone:	rc	Keywords:	Triaged
Target Release:	8.6	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Pacemaker's controller processed corosync notifications of nodes leaving individually. Consequence: If the DC and another node left at the same time, but the other node's corosync notification was processed first, then the other node's transient node attributes would not be cleared. This could cause problems such as the node being shut down again when it rejoins the cluster due to the shutdown attribute remaining. Fix: The controller now looks through all corosync notifications to see if the DC is leaving, and if so, all nodes will remove leaving nodes' attributes. Result: Transient attributes do not get "stuck" when a node leaves at the same time as the DC, and the node is not wrongly shut down when it rejoins the cluster.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-02-07 15:32:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sofer Athlan-Guyot 2020-08-25 16:14:46 UTC

Description of problem:  This bug seems to be a direct copy of https://bugzilla.redhat.com/show_bug.cgi?id=1791841 where a node shutdown is unexpectedly programmed during a parallel update of a cluster by the DC.

I copy/paste/adjust the relevant part from Michele's comment for convenience.


--- Michele --
We have a cluster with 9 full pcmk-nodes (no remotes), hosting the usual OSP resources (bunch of bundles and a few VIPs). The nodes are:
controller-0, controller-1, controller-2
messaging-0, messaging-1, messaging-2
database-0, database-1, database-3

Now when we trigger a minor update we basically do the following pseudo-operations:
update() {
  pcs cluster stop
  bunch of container updates
  pcs cluster start
}

for i in 0 1 2; do
  A) update() on controller-$i
  B) update() on messaging-$i
  C) update() on database-$i
done

Note that A,B,C happen in parallel. We do this because we do not lose quorum and
because it speeds things down considerable (2 hours less in the update process, than if we did it serially on each node. NB: when we do this serially no problem is observed).

--- End Michele ---

In the log we have: 

database-1 is the current DC and database-2 is just joining the cluster.  Then database-1 tells database-2 to shutdown:


Aug 24 19:50:55 controller-1 pacemaker-schedulerd[36556] (sched_shutdown_op)    notice: Scheduling shutdown of node database-2
Aug 24 19:50:55 controller-1 pacemaker-schedulerd[36556] (LogNodeActions)       notice:  * Shutdown database-2


database-2 see this as an unexpected action:

Aug 24 19:50:35 database-2 pacemaker-controld  [358348] (update_dc)     info: Set DC to controller-1 (3.2.0)

Aug 24 19:50:55 database-2 pacemaker-controld  [358348] (handle_request)        error: We didn't ask to be shut down, yet our DC is telling us to.


Version-Release number of selected component (if applicable):

The versions used are:
                    
pacemaker.x86_64                              2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-ha             
pacemaker-cli.x86_64                          2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-ha             
pacemaker-cluster-libs.x86_64                 2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-appstream      
pacemaker-libs.x86_64                         2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-appstream      
pacemaker-remote.x86_64                       2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-ha             
pacemaker-schemas.noarch                      2.0.3-5.el8_2.1                                 @rhosp-rhel-8.2-appstream 

and inside the container:

pacemaker-schemas-2.0.3-5.el8_2.1.noarch
pacemaker-cli-2.0.3-5.el8_2.1.x86_64
pacemaker-remote-2.0.3-5.el8_2.1.x86_64
pacemaker-libs-2.0.3-5.el8_2.1.x86_64
pacemaker-cluster-libs-2.0.3-5.el8_2.1.x86_64
pacemaker-2.0.3-5.el8_2.1.x86_64

Steps to Reproduce:
1. run a parallel role update of osp16.0/rhel8.1 -> osp16.1/rhel8.2

Comment 2 Ken Gaillot 2020-09-09 16:32:57 UTC

This is indeed very similar to Bug 1791841 -- unfortunately there is still a timing issue that can result in the same effect. The problematic sequence is:

1. The elected DC at the time and some other node both initiate shutdown (by setting the shutdown node attribute) at the same time. (In this case, messaging-2 is the DC and database-2 is the other node, at 19:37:54.)

2. The DC and other node complete stopping resources and leave the cluster at nearly the same time, but corosync delivers the notification for the other node's departure first.

3. Since the DC has not yet left the cluster when the other node's departure notification is received by the surviving nodes, the surviving nodes assume the DC has erased the leaving node's attributes. However the DC has actually just left and does not erase them, leaving the shutdown attribute in place when the other node rejoins. (The surviving nodes will correctly erase the DC's own attributes when they get notified of its departure shortly after.)

Our current approach for when a node leaves the cluster is to have the DC erase the node's attributes if there is an elected DC, otherwise all surviving nodes will erase the node's attributes (i.e. if either there is no DC or the leaving node is the DC). I think this is fundamentally subject to timing issues such as this and will have to be completely rethought.

Comment 9 Ken Gaillot 2022-03-25 16:55:48 UTC

This was actually fixed in the upstream 2.1.1 release by commit ee7eba6a, which lands in RHEL 8.6. There is not time to get this bz formally added to 8.6 this late in the release process, but the fix will be there.