Bug 1791841

Summary:	restarting nodes in parallel while maintaining quorum creates an unexpected node shutdown
Product:	Red Hat Enterprise Linux 8	Reporter:	Michele Baldessari <michele>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	8.1	CC:	bdobreli, cluster-maint, lmiccini, phagara, pkomarov, sathlang
Target Milestone:	pre-dev-freeze
Target Release:	8.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pacemaker-2.0.3-5.el8	Doc Type:	Bug Fix
Doc Text:	Cause: When a non-DC node leaves the cluster, the DC clears its transient node attributes from the CIB. Consequence: If no DC was elected at the time a node left, its transient node attributes would remain in effect when it later rejoined the cluster -- including shutting the node down again if it had been scheduled for shutdown previously. Fix: If no DC is elected when a node leaves, all cluster nodes will remove the leaving node's transient node attributes from the CIB. Result: No matter when a node leaves the cluster, its transient node attributes will be cleared from the CIB, avoiding a double shutdown when no DC is elected.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-28 15:38:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1791135

Description Michele Baldessari 2020-01-16 15:20:36 UTC

We have a cluster with 9 full pcmk-nodes (no remotes), hosting the usual OSP resources (bunch of bundles and a few VIPs). The nodes are:
controller-0, controller-1, controller-2
messaging-0, messaging-1, messaging-2
database-0, database-1, database-3

Now when we trigger a minor update we basically do the following pseudo-operations:
update() {
  pcs cluster stop
  bunch of container updates
  pcs cluster start
}

for i in 0 1 2; do
  A) update() on controller-$i
  B) update() on messaging-$i
  C) update() on database-$i
done

Note that A,B,C happen in parallel. We do this because we do not lose quorum and
because it speeds things down considerable (2 hours less in the update process, than if we did it serially on each node. NB: when we do this serially no problem is observed).

What we observe when we do this update() in parallel is that the cluster ends up
in an unexpected state. Here is one timeline we reconstructed on one environment [1] (we obtain the same result when fencing is configured, see [2]):

1. At the following time database-1 (which just got started and joined the cluster correctly, observed node messaging-1 coming back up)
Jan 15 04:22:20 database-1 pacemaker-based[249477]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 database-1 pacemaker-attrd[249480]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 database-1 corosync[249448]:  [QUORUM] Members[9]: 1 2 3 4 5 6 7 8 9
Jan 15 04:22:20 database-1 corosync[249448]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 15 04:22:20 database-1 pacemakerd[249473]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 database-1 pacemaker-controld[249482]: notice: Node messaging-1 state is now member

2. The DC which is messaging-2 tells database-1 to shut itself off. At this point database-1 will stop itself and needs
   manual intervention
Jan 15 04:22:20 database-1 pacemaker-attrd[249480]: notice: Detected another attribute writer (messaging-2), starting new election
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node rabbitmq-bundle-1 state is now lost
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node rabbitmq-bundle-2 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node galera-bundle-0 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node redis-bundle-1 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node redis-bundle-2 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node galera-bundle-2 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node galera-bundle-1 state is now lost
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node ovn-dbs-bundle-1 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node redis-bundle-0 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node ovn-dbs-bundle-2 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node ovn-dbs-bundle-0 state is now member
Jan 15 04:22:20 database-1 pacemaker-fenced[249478]: notice: Node rabbitmq-bundle-0 state is now member
Jan 15 04:22:20 database-1 pacemaker-attrd[249480]: notice: Recorded new attribute writer: messaging-1 (was database-1)
Jan 15 04:22:20 database-1 pacemaker-controld[249482]: notice: State transition S_PENDING -> S_NOT_DC
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Result of probe operation for galera-bundle-0 on database-1: 7 (not running)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Result of probe operation for galera-bundle-2 on database-1: 7 (not running)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Result of probe operation for rabbitmq-bundle-0 on database-1: 7 (not running)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Result of probe operation for rabbitmq-bundle-2 on database-1: 7 (not running)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Result of probe operation for redis-bundle-0 on database-1: 7 (not running)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: error: We didn't ask to be shut down, yet our DC is telling us to.
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: State transition S_NOT_DC -> S_STOPPING
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Stopped 0 recurring operations at shutdown... waiting (3 remaining)
Jan 15 04:22:22 database-1 pacemaker-controld[249482]: notice: Disconnected from the executor

3. On the DC (messaging-2) we see the following:
Jan 15 04:22:20 messaging-2 pacemaker-controld[14317]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 messaging-2 pacemakerd[14311]: notice: Node messaging-1 state is now member
Jan 15 04:22:20 messaging-2 pacemaker-attrd[14315]: notice: Detected another attribute writer (database-1), starting new election
Jan 15 04:22:20 messaging-2 pacemaker-attrd[14315]: notice: Recorded new attribute writer: messaging-1 (was messaging-2)
Jan 15 04:22:20 messaging-2 pacemaker-attrd[14315]: notice: Recorded local node as attribute writer (was unset)
Jan 15 04:22:22 messaging-2 pacemaker-schedulerd[14316]: notice: Scheduling shutdown of node database-1
Jan 15 04:22:22 messaging-2 pacemaker-schedulerd[14316]: notice:  * Shutdown database-1
Jan 15 04:22:22 messaging-2 pacemaker-schedulerd[14316]: notice:  * Start      galera-bundle-1     (      controller-0 )   due to unrunnable galera-bundle-podman-1 start (blocked)
Jan 15 04:22:22 messaging-2 pacemaker-schedulerd[14316]: notice:  * Start      galera:1            (   galera-bundle-1 )   due to unrunnable galera-bundle-podman-1 start (blocked)

It's not entirely obvious to us why messaging-2 (DC) decides to just shut down database-1 right after messaging-1 has joined the cluster.
So this seems unexpected?

versions:
pacemaker-2.0.2-3.el8_1.2.x86_64
corosync-3.0.2-3.el8_1.1.x86_64

Additional Info:
We can reproduce this situation fairly reliably, although it takes ~14-15 hours for the job to reproduce, so not exactly super fast ;)

Comment 2 Ken Gaillot 2020-01-21 22:55:04 UTC

You always find the fun cases. :)

Pacemaker coordinates node shutdown via a special transient node attribute named (appropriately enough) "shutdown". If it's nonzero, the DC schedules the node for shutdown. When the node leaves the cluster, it gets cleared.

Usually.

The DC is the one that clears a shutting-down node's transient node attributes from the CIB. What happened here is that database-1 left the cluster immediately after the DC (messaging-1 at the time) did, so the rest of the cluster was still in the process of electing a new DC when database-1 left -- meaning there was no one to clear its attributes from the CIB. When it later rejoined, its shutdown attribute was still nonzero, so it got scheduled for shutdown again.

My first idea for a fix would be for every node to trigger the clearing if there is no DC when a node leaves. It's inefficient, but we already do the same thing when the DC itself leaves.

Workarounds in the meantime would be to either serialize updates, or query which node is DC beforehand (via crmadmin -D) and save its updates for last.

Comment 7 Ken Gaillot 2020-02-13 16:15:01 UTC

Fixed upstream by commit 01b463bd

Comment 15 Sofer Athlan-Guyot 2020-02-27 12:07:07 UTC

Hi, so I've run a //role update of OSP16 using the same version that was usually failing with the previous pacemaker version[1]

The run was successful relative to the current issue[2][3].  

Note that I cannot move this bz to VERIFIED, but it is.

I'll expand a bit more and show that the right version of pacemaker was used:

# After the update the cluster is fine

[heat-admin@controller-1 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Cluster Summary:
* Stack: corosync
* Current DC: messaging-0 (version 2.0.3-5.el8-4b1f869f0f) - partition with quorum
* Last updated: Thu Feb 27 11:48:35 2020
* Last change:  Thu Feb 27 00:58:41 2020 by redis-bundle-2 via crm_attribute on controller-2
* 21 nodes configured
* 57 resource instances configured

Node List:
* Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
* GuestOnline: [ galera-bundle-0@database-0 galera-bundle-1@database-1 galera-bundle-2@database-2 ovn-dbs-bundle-0@controller-1 ovn-dbs-bundle-1@controller-0 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@messaging-0 rabbitmq-bundle-1
@messaging-1 rabbitmq-bundle-2@messaging-2 redis-bundle-0@controller-1 redis-bundle-1@controller-0 redis-bundle-2@controller-2 ]

Full List of Resources:
* Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]:
* galera-bundle-0   (ocf::heartbeat:galera):        Master database-0
* galera-bundle-1   (ocf::heartbeat:galera):        Master database-1
* galera-bundle-2   (ocf::heartbeat:galera):        Master database-2
* Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:
* rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster):      Started messaging-0
* rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster):      Started messaging-1
* rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster):      Started messaging-2
* Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]:
* redis-bundle-0    (ocf::heartbeat:redis): Slave controller-1
* redis-bundle-1    (ocf::heartbeat:redis): Slave controller-0
* redis-bundle-2    (ocf::heartbeat:redis): Master controller-2
* ip-192.168.24.33    (ocf::heartbeat:IPaddr2):       Started controller-2
* ip-2620.52.0.13b8.5054.ff.fe3e.1    (ocf::heartbeat:IPaddr2):       Started controller-2
* ip-fd00.fd00.fd00.2000..33c (ocf::heartbeat:IPaddr2):       Started controller-2
* ip-fd00.fd00.fd00.2000..10d (ocf::heartbeat:IPaddr2):       Started controller-2
* ip-fd00.fd00.fd00.3000..da  (ocf::heartbeat:IPaddr2):       Started controller-2
* ip-fd00.fd00.fd00.4000..1f9 (ocf::heartbeat:IPaddr2):       Started controller-2
* Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]:

* ip-fd00.fd00.fd00.4000..1f9 (ocf::heartbeat:IPaddr2):       Started controller-2
* Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]:
* haproxy-bundle-podman-0   (ocf::heartbeat:podman):        Started controller-1
* haproxy-bundle-podman-1   (ocf::heartbeat:podman):        Started controller-0
* haproxy-bundle-podman-2   (ocf::heartbeat:podman):        Started controller-2
* Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]:
* ovn-dbs-bundle-0  (ocf::ovn:ovndb-servers):       Master controller-1
* ovn-dbs-bundle-1  (ocf::ovn:ovndb-servers):       Slave controller-0
* ovn-dbs-bundle-2  (ocf::ovn:ovndb-servers):       Slave controller-2
* ip-fd00.fd00.fd00.2000..d4  (ocf::heartbeat:IPaddr2):       Started controller-1
* stonith-fence_ipmilan-5254000e7624  (stonith:fence_ipmilan):        Started database-0
* stonith-fence_ipmilan-5254008ddd24  (stonith:fence_ipmilan):        Started database-2
* stonith-fence_ipmilan-5254008d88d6  (stonith:fence_ipmilan):        Started messaging-0
* stonith-fence_ipmilan-5254006c8596  (stonith:fence_ipmilan):        Started messaging-0
* stonith-fence_ipmilan-525400d32f45  (stonith:fence_ipmilan):        Started messaging-1
* stonith-fence_ipmilan-525400979c13  (stonith:fence_ipmilan):        Started database-2
* stonith-fence_ipmilan-52540017c322  (stonith:fence_ipmilan):        Started messaging-2
* stonith-fence_ipmilan-525400a7bb1c  (stonith:fence_ipmilan):        Started messaging-1
* stonith-fence_ipmilan-525400619c69  (stonith:fence_ipmilan):        Started database-2
* Container bundle: openstack-cinder-backup [cluster.common.tag/rhosp16-openstack-cinder-backup:pcmklatest]:
* openstack-cinder-backup-podman-0  (ocf::heartbeat:podman):        Started controller-1
* Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]:
* openstack-cinder-volume-podman-0  (ocf::heartbeat:podman):        Started controller-2


Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled



# pacemaker is updated in the container:

[heat-admin@controller-1 ~]$ sudo podman ps | grep hotfix
1b8a25822eb1  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-redis:20200130.1-hotfix          /bin/bash /usr/lo...  11 hours ago  Up 11 hours ago         redis-bundle-podman-0
ba22ff49e131  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:20200130.1-hotfix     /bin/bash /usr/lo...  11 hours ago  Up 11 hours ago         ovn-dbs-bundle-podman-0
aa8f631b5910  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-backup:20200130.1-hotfix  /bin/bash /usr/lo...  11 hours ago  Up 11 hours ago         openstack-cinder-backup-podman-0
a7665faeb2f8  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-haproxy:20200130.1-hotfix        /bin/bash /usr/lo...  11 hours ago  Up 11 hours ago         haproxy-bundle-podman-0
[heat-admin@controller-1 ~]$ echo and has the right version
and has the right version
[heat-admin@controller-1 ~]$ sudo podman exec -ti haproxy-bundle-podman-0 rpm -qa | grep ^pacemaker
pacemaker-cluster-libs-2.0.3-5.el8.x86_64
pacemaker-schemas-2.0.3-5.el8.noarch
pacemaker-cli-2.0.3-5.el8.x86_64
pacemaker-2.0.3-5.el8.x86_64
pacemaker-libs-2.0.3-5.el8.x86_64
pacemaker-remote-2.0.3-5.el8.x86_64


# on the host too
[heat-admin@controller-1 ~]$ rpm -qa | grep ^pacemaker
pacemaker-2.0.3-5.el8.x86_64
pacemaker-cli-2.0.3-5.el8.x86_64
pacemaker-libs-2.0.3-5.el8.x86_64
pacemaker-remote-2.0.3-5.el8.x86_64
pacemaker-schemas-2.0.3-5.el8.noarch
pacemaker-cluster-libs-2.0.3-5.el8.x86_64

Thanks.

[1] current version of osp16 is failing on some other things
[2] got a transient error during last step of vm migration but it's completely unrelated in http://staging-jenkins2-qe-playground.usersys.redhat.com/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-composable-ipv6/ job 5
[3] on the same ci, job 4 passed completely but was using pacemaker-2.0.3-4.el8.1.x86_64 in the container.

Comment 16 Patrik Hagara 2020-02-27 12:38:24 UTC

marking verified in pacemaker-2.0.3-5.el8 as per comment#15

Comment 18 errata-xmlrpc 2020-04-28 15:38:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:1609