1791841 – restarting nodes in parallel while maintaining quorum creates an unexpected node shutdown

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1791841 - restarting nodes in parallel while maintaining quorum creates an unexpected node shutdown

Summary: restarting nodes in parallel while maintaining quorum creates an unexpected n...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	pre-dev-freeze
Target Release:	8.2
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1791135
TreeView+	depends on / blocked

Reported:	2020-01-16 15:20 UTC by Michele Baldessari
Modified:	2021-09-03 15:17 UTC (History)
CC List:	6 users (show)
Fixed In Version:	pacemaker-2.0.3-5.el8
Doc Type:	Bug Fix
Doc Text:	Cause: When a non-DC node leaves the cluster, the DC clears its transient node attributes from the CIB. Consequence: If no DC was elected at the time a node left, its transient node attributes would remain in effect when it later rejoined the cluster -- including shutting the node down again if it had been scheduled for shutdown previously. Fix: If no DC is elected when a node leaves, all cluster nodes will remove the leaving node's transient node attributes from the CIB. Result: No matter when a node leaves the cluster, its transient node attributes will be cleared from the CIB, avoiding a double shutdown when no DC is elected.
Clone Of:
Environment:
Last Closed:	2020-04-28 15:38:28 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-32567	0	None	None	None	2021-09-03 15:17:23 UTC
Red Hat Product Errata	RHEA-2020:1609	0	None	None	None	2020-04-28 15:39:10 UTC

Description Michele Baldessari 2020-01-16 15:20:36 UTC

We have a cluster with 9 full pcmk-nodes (no remotes), hosting the usual OSP resources (bunch of bundles and a few VIPs). The nodes are:
controller-0, controller-1, controller-2
messaging-0, messaging-1, messaging-2
database-0, database-1, database-3

Now when we trigger a minor update we basically do the following pseudo-operations:
update() {
  pcs cluster stop
  bunch of container updates
  pcs cluster start
}

for i in 0 1 2; do
  A) update() on controller-$i
  B) update() on messaging-$i
  C) update() on database-$i
done

Note that A,B,C happen in parallel. We do this because we do not lose quorum and
because it speeds things down considerable (2 hours less in the update process, than if we did it serially on each node. NB: when we do this serially no problem is observed).

What we observe when we do this update() in parallel is that the cluster ends up
in an unexpected state. Here is one timeline we reconstructed on one environment [1] (we obtain the same result when fencing is configured, see [2]):

1. At the following time database-1 (which just got started and joined the cluster correctly, observed node messaging-1 coming back up)
Jan 15 04:22:20 database-1 pacemaker-based[249477]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 database-1 pacemaker-attrd[249480]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 database-1 corosync[249448]:  [QUORUM] Members[9]: 1 2 3 4 5 6 7 8 9
Jan 15 04:22:20 database-1 corosync[249448]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 15 04:22:20 database-1 pacemakerd[249473]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 database-1 pacemaker-controld[249482]: notice: Node messaging-1 state is now member

2. The DC which is messaging-2 tells database-1 to shut itself off. At this point database-1 will stop itself and needs
   manual intervention
Jan 15 04:22:20 database-1 pacemaker-attrd[249480]: notice: Detected another attribute writer (messaging-2), starting new election
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node rabbitmq-bundle-1 state is now lost
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node rabbitmq-bundle-2 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node galera-bundle-0 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node redis-bundle-1 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node redis-bundle-2 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node galera-bundle-2 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node galera-bundle-1 state is now lost
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node ovn-dbs-bundle-1 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node redis-bundle-0 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node ovn-dbs-bundle-2 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node ovn-dbs-bundle-0 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node rabbitmq-bundle-0 state is now member
Jan 15 04:22:20 database-1 pacemaker-attrd[249480]: notice: Recorded new attribute writer: messaging-1 (was database-1)
Jan 15 04:22:20 database-1 pacemaker-controld[249482]: notice: State transition S_PENDING -> S_NOT_DC
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Result of probe operation for galera-bundle-0 on database-1: 7 (not running)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Result of probe operation for galera-bundle-2 on database-1: 7 (not running)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Result of probe operation for rabbitmq-bundle-0 on database-1: 7 (not running)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Result of probe operation for rabbitmq-bundle-2 on database-1: 7 (not running)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Result of probe operation for redis-bundle-0 on database-1: 7 (not running)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: error: We didn't ask to be shut down, yet our DC is telling us to.
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: State transition S_NOT_DC -> S_STOPPING
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Stopped 0 recurring operations at shutdown... waiting (3 remaining)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Disconnected from the executor

3. On the DC (messaging-2) we see the following:
Jan 15 04:22:20 messaging-2 pacemaker-controld[14317]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 messaging-2 pacemakerd[14311]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 messaging-2 pacemaker-attrd[14315]: notice: Detected another attribute writer (database-1), starting new election
Jan 15 04:22:20 messaging-2 pacemaker-attrd[14315]: notice: Recorded new attribute writer: messaging-1 (was messaging-2)
Jan 15 04:22:20 messaging-2 pacemaker-attrd[14315]: notice: Recorded local node as attribute writer (was unset)
Jan 15 04:22:22 messaging-2 pacemaker-schedulerd[14316]: notice: Scheduling shutdown of node database-1
Jan 15 04:22:22 messaging-2 pacemaker-schedulerd[14316]: notice:  * Shutdown database-1
Jan 15 04:22:22 messaging-2 pacemaker-schedulerd[14316]: notice:  * Start      galera-bundle-1     (      controller-0 )   due to unrunnable galera-bundle-podman-1 start (blocked)
Jan 15 04:22:22 messaging-2 pacemaker-schedulerd[14316]: notice:  * Start      galera:1            (   galera-bundle-1 )   due to unrunnable galera-bundle-podman-1 start (blocked)

It's not entirely obvious to us why messaging-2 (DC) decides to just shut down database-1 right after messaging-1 has joined the cluster.
So this seems unexpected?

versions:
pacemaker-2.0.2-3.el8_1.2.x86_64
corosync-3.0.2-3.el8_1.1.x86_64

Additional Info:
We can reproduce this situation fairly reliably, although it takes ~14-15 hours for the job to reproduce, so not exactly super fast ;)

Comment 2 Ken Gaillot 2020-01-21 22:55:04 UTC

You always find the fun cases. :)

Pacemaker coordinates node shutdown via a special transient node attribute named (appropriately enough) "shutdown". If it's nonzero, the DC schedules the node for shutdown. When the node leaves the cluster, it gets cleared.

Usually.

The DC is the one that clears a shutting-down node's transient node attributes from the CIB. What happened here is that database-1 left the cluster immediately after the DC (messaging-1 at the time) did, so the rest of the cluster was still in the process of electing a new DC when database-1 left -- meaning there was no one to clear its attributes from the CIB. When it later rejoined, its shutdown attribute was still nonzero, so it got scheduled for shutdown again.

My first idea for a fix would be for every node to trigger the clearing if there is no DC when a node leaves. It's inefficient, but we already do the same thing when the DC itself leaves.

Workarounds in the meantime would be to either serialize updates, or query which node is DC beforehand (via crmadmin -D) and save its updates for last.

Comment 7 Ken Gaillot 2020-02-13 16:15:01 UTC

Fixed upstream by commit 01b463bd

Comment 15 Sofer Athlan-Guyot 2020-02-27 12:07:07 UTC

Hi, so I've run a //role update of OSP16 using the same version that was usually failing with the previous pacemaker version[1]

The run was successful relative to the current issue[2][3].  

Note that I cannot move this bz to VERIFIED, but it is.

I'll expand a bit more and show that the right version of pacemaker was used:

# After the update the cluster is fine

[heat-admin@controller-1 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Cluster Summary:
* Stack: corosync
* Current DC: messaging-0 (version 2.0.3-5.el8-4b1f869f0f) - partition with quorum
* Last updated: Thu Feb 27 11:48:35 2020
* Last change:  Thu Feb 27 00:58:41 2020 by redis-bundle-2 via crm_attribute on controller-2
* 21 nodes configured
* 57 resource instances configured

Node List:
* Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
* GuestOnline: [ galera-bundle-0@database-0 galera-bundle-1@database-1 galera-bundle-2@database-2 ovn-dbs-bundle-0@controller-1 ovn-dbs-bundle-1@controller-0 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@messaging-0 rabbitmq-bundle-1
@messaging-1 rabbitmq-bundle-2@messaging-2 redis-bundle-0@controller-1 redis-bundle-1@controller-0 redis-bundle-2@controller-2 ]

Full List of Resources:
* Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]:
* galera-bundle-0   (ocf::heartbeat:galera):        Master database-0
* galera-bundle-1   (ocf::heartbeat:galera):        Master database-1
* galera-bundle-2   (ocf::heartbeat:galera):        Master database-2
* Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:
* rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster):      Started messaging-0
* rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster):      Started messaging-1
* rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster):      Started messaging-2
* Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]:
* redis-bundle-0    (ocf::heartbeat:redis): Slave controller-1
* redis-bundle-1    (ocf::heartbeat:redis): Slave controller-0
* redis-bundle-2    (ocf::heartbeat:redis): Master controller-2
* ip-192.168.24.33    (ocf::heartbeat:IPaddr2):       Started controller-2
* ip-2620.52.0.13b8.5054.ff.fe3e.1    (ocf::heartbeat:IPaddr2):       Started controller-2
* ip-fd00.fd00.fd00.2000..33c (ocf::heartbeat:IPaddr2):       Started controller-2
* ip-fd00.fd00.fd00.2000..10d (ocf::heartbeat:IPaddr2):       Started controller-2
* ip-fd00.fd00.fd00.3000..da  (ocf::heartbeat:IPaddr2):       Started controller-2
* ip-fd00.fd00.fd00.4000..1f9 (ocf::heartbeat:IPaddr2):       Started controller-2
* Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]:

* ip-fd00.fd00.fd00.4000..1f9 (ocf::heartbeat:IPaddr2):       Started controller-2
* Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]:
* haproxy-bundle-podman-0   (ocf::heartbeat:podman):        Started controller-1
* haproxy-bundle-podman-1   (ocf::heartbeat:podman):        Started controller-0
* haproxy-bundle-podman-2   (ocf::heartbeat:podman):        Started controller-2
* Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]:
* ovn-dbs-bundle-0  (ocf::ovn:ovndb-servers):       Master controller-1
* ovn-dbs-bundle-1  (ocf::ovn:ovndb-servers):       Slave controller-0
* ovn-dbs-bundle-2  (ocf::ovn:ovndb-servers):       Slave controller-2
* ip-fd00.fd00.fd00.2000..d4  (ocf::heartbeat:IPaddr2):       Started controller-1
* stonith-fence_ipmilan-5254000e7624  (stonith:fence_ipmilan):        Started database-0
* stonith-fence_ipmilan-5254008ddd24  (stonith:fence_ipmilan):        Started database-2
* stonith-fence_ipmilan-5254008d88d6  (stonith:fence_ipmilan):        Started messaging-0
* stonith-fence_ipmilan-5254006c8596  (stonith:fence_ipmilan):        Started messaging-0
* stonith-fence_ipmilan-525400d32f45  (stonith:fence_ipmilan):        Started messaging-1
* stonith-fence_ipmilan-525400979c13  (stonith:fence_ipmilan):        Started database-2
* stonith-fence_ipmilan-52540017c322  (stonith:fence_ipmilan):        Started messaging-2
* stonith-fence_ipmilan-525400a7bb1c  (stonith:fence_ipmilan):        Started messaging-1
* stonith-fence_ipmilan-525400619c69  (stonith:fence_ipmilan):        Started database-2
* Container bundle: openstack-cinder-backup [cluster.common.tag/rhosp16-openstack-cinder-backup:pcmklatest]:
* openstack-cinder-backup-podman-0  (ocf::heartbeat:podman):        Started controller-1
* Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]:
* openstack-cinder-volume-podman-0  (ocf::heartbeat:podman):        Started controller-2


Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled



# pacemaker is updated in the container:

[heat-admin@controller-1 ~]$ sudo podman ps | grep hotfix
1b8a25822eb1  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-redis:20200130.1-hotfix          /bin/bash /usr/lo...  11 hours ago  Up 11 hours ago         redis-bundle-podman-0
ba22ff49e131  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:20200130.1-hotfix     /bin/bash /usr/lo...  11 hours ago  Up 11 hours ago         ovn-dbs-bundle-podman-0
aa8f631b5910  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-backup:20200130.1-hotfix  /bin/bash /usr/lo...  11 hours ago  Up 11 hours ago         openstack-cinder-backup-podman-0
a7665faeb2f8  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-haproxy:20200130.1-hotfix        /bin/bash /usr/lo...  11 hours ago  Up 11 hours ago         haproxy-bundle-podman-0
[heat-admin@controller-1 ~]$ echo and has the right version
and has the right version
[heat-admin@controller-1 ~]$ sudo podman exec -ti haproxy-bundle-podman-0 rpm -qa | grep ^pacemaker
pacemaker-cluster-libs-2.0.3-5.el8.x86_64
pacemaker-schemas-2.0.3-5.el8.noarch
pacemaker-cli-2.0.3-5.el8.x86_64
pacemaker-2.0.3-5.el8.x86_64
pacemaker-libs-2.0.3-5.el8.x86_64
pacemaker-remote-2.0.3-5.el8.x86_64


# on the host too
[heat-admin@controller-1 ~]$ rpm -qa | grep ^pacemaker
pacemaker-2.0.3-5.el8.x86_64
pacemaker-cli-2.0.3-5.el8.x86_64
pacemaker-libs-2.0.3-5.el8.x86_64
pacemaker-remote-2.0.3-5.el8.x86_64
pacemaker-schemas-2.0.3-5.el8.noarch
pacemaker-cluster-libs-2.0.3-5.el8.x86_64

Thanks.

[1] current version of osp16 is failing on some other things
[2] got a transient error during last step of vm migration but it's completely unrelated in http://staging-jenkins2-qe-playground.usersys.redhat.com/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-composable-ipv6/ job 5
[3] on the same ci, job 4 passed completely but was using pacemaker-2.0.3-4.el8.1.x86_64 in the container.

Comment 16 Patrik Hagara 2020-02-27 12:38:24 UTC

marking verified in pacemaker-2.0.3-5.el8 as per comment#15

Comment 18 errata-xmlrpc 2020-04-28 15:38:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:1609

Note You need to log in before you can comment on or make changes to this bug.