Bug 1791841
Summary: | restarting nodes in parallel while maintaining quorum creates an unexpected node shutdown | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Michele Baldessari <michele> |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 8.1 | CC: | bdobreli, cluster-maint, lmiccini, phagara, pkomarov, sathlang |
Target Milestone: | pre-dev-freeze | ||
Target Release: | 8.2 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-2.0.3-5.el8 | Doc Type: | Bug Fix |
Doc Text: |
Cause: When a non-DC node leaves the cluster, the DC clears its transient node attributes from the CIB.
Consequence: If no DC was elected at the time a node left, its transient node attributes would remain in effect when it later rejoined the cluster -- including shutting the node down again if it had been scheduled for shutdown previously.
Fix: If no DC is elected when a node leaves, all cluster nodes will remove the leaving node's transient node attributes from the CIB.
Result: No matter when a node leaves the cluster, its transient node attributes will be cleared from the CIB, avoiding a double shutdown when no DC is elected.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-04-28 15:38:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1791135 |
Description
Michele Baldessari
2020-01-16 15:20:36 UTC
You always find the fun cases. :) Pacemaker coordinates node shutdown via a special transient node attribute named (appropriately enough) "shutdown". If it's nonzero, the DC schedules the node for shutdown. When the node leaves the cluster, it gets cleared. Usually. The DC is the one that clears a shutting-down node's transient node attributes from the CIB. What happened here is that database-1 left the cluster immediately after the DC (messaging-1 at the time) did, so the rest of the cluster was still in the process of electing a new DC when database-1 left -- meaning there was no one to clear its attributes from the CIB. When it later rejoined, its shutdown attribute was still nonzero, so it got scheduled for shutdown again. My first idea for a fix would be for every node to trigger the clearing if there is no DC when a node leaves. It's inefficient, but we already do the same thing when the DC itself leaves. Workarounds in the meantime would be to either serialize updates, or query which node is DC beforehand (via crmadmin -D) and save its updates for last. Fixed upstream by commit 01b463bd Hi, so I've run a //role update of OSP16 using the same version that was usually failing with the previous pacemaker version[1] The run was successful relative to the current issue[2][3]. Note that I cannot move this bz to VERIFIED, but it is. I'll expand a bit more and show that the right version of pacemaker was used: # After the update the cluster is fine [heat-admin@controller-1 ~]$ sudo pcs status Cluster name: tripleo_cluster Cluster Summary: * Stack: corosync * Current DC: messaging-0 (version 2.0.3-5.el8-4b1f869f0f) - partition with quorum * Last updated: Thu Feb 27 11:48:35 2020 * Last change: Thu Feb 27 00:58:41 2020 by redis-bundle-2 via crm_attribute on controller-2 * 21 nodes configured * 57 resource instances configured Node List: * Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ] * GuestOnline: [ galera-bundle-0@database-0 galera-bundle-1@database-1 galera-bundle-2@database-2 ovn-dbs-bundle-0@controller-1 ovn-dbs-bundle-1@controller-0 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@messaging-0 rabbitmq-bundle-1 @messaging-1 rabbitmq-bundle-2@messaging-2 redis-bundle-0@controller-1 redis-bundle-1@controller-0 redis-bundle-2@controller-2 ] Full List of Resources: * Container bundle set: galera-bundle [cluster.common.tag/rhosp16-openstack-mariadb:pcmklatest]: * galera-bundle-0 (ocf::heartbeat:galera): Master database-0 * galera-bundle-1 (ocf::heartbeat:galera): Master database-1 * galera-bundle-2 (ocf::heartbeat:galera): Master database-2 * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started messaging-0 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started messaging-1 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started messaging-2 * Container bundle set: redis-bundle [cluster.common.tag/rhosp16-openstack-redis:pcmklatest]: * redis-bundle-0 (ocf::heartbeat:redis): Slave controller-1 * redis-bundle-1 (ocf::heartbeat:redis): Slave controller-0 * redis-bundle-2 (ocf::heartbeat:redis): Master controller-2 * ip-192.168.24.33 (ocf::heartbeat:IPaddr2): Started controller-2 * ip-2620.52.0.13b8.5054.ff.fe3e.1 (ocf::heartbeat:IPaddr2): Started controller-2 * ip-fd00.fd00.fd00.2000..33c (ocf::heartbeat:IPaddr2): Started controller-2 * ip-fd00.fd00.fd00.2000..10d (ocf::heartbeat:IPaddr2): Started controller-2 * ip-fd00.fd00.fd00.3000..da (ocf::heartbeat:IPaddr2): Started controller-2 * ip-fd00.fd00.fd00.4000..1f9 (ocf::heartbeat:IPaddr2): Started controller-2 * Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]: * ip-fd00.fd00.fd00.4000..1f9 (ocf::heartbeat:IPaddr2): Started controller-2 * Container bundle set: haproxy-bundle [cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest]: * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-1 * haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-0 * haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started controller-2 * Container bundle set: ovn-dbs-bundle [cluster.common.tag/rhosp16-openstack-ovn-northd:pcmklatest]: * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master controller-1 * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave controller-0 * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave controller-2 * ip-fd00.fd00.fd00.2000..d4 (ocf::heartbeat:IPaddr2): Started controller-1 * stonith-fence_ipmilan-5254000e7624 (stonith:fence_ipmilan): Started database-0 * stonith-fence_ipmilan-5254008ddd24 (stonith:fence_ipmilan): Started database-2 * stonith-fence_ipmilan-5254008d88d6 (stonith:fence_ipmilan): Started messaging-0 * stonith-fence_ipmilan-5254006c8596 (stonith:fence_ipmilan): Started messaging-0 * stonith-fence_ipmilan-525400d32f45 (stonith:fence_ipmilan): Started messaging-1 * stonith-fence_ipmilan-525400979c13 (stonith:fence_ipmilan): Started database-2 * stonith-fence_ipmilan-52540017c322 (stonith:fence_ipmilan): Started messaging-2 * stonith-fence_ipmilan-525400a7bb1c (stonith:fence_ipmilan): Started messaging-1 * stonith-fence_ipmilan-525400619c69 (stonith:fence_ipmilan): Started database-2 * Container bundle: openstack-cinder-backup [cluster.common.tag/rhosp16-openstack-cinder-backup:pcmklatest]: * openstack-cinder-backup-podman-0 (ocf::heartbeat:podman): Started controller-1 * Container bundle: openstack-cinder-volume [cluster.common.tag/rhosp16-openstack-cinder-volume:pcmklatest]: * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled # pacemaker is updated in the container: [heat-admin@controller-1 ~]$ sudo podman ps | grep hotfix 1b8a25822eb1 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-redis:20200130.1-hotfix /bin/bash /usr/lo... 11 hours ago Up 11 hours ago redis-bundle-podman-0 ba22ff49e131 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-northd:20200130.1-hotfix /bin/bash /usr/lo... 11 hours ago Up 11 hours ago ovn-dbs-bundle-podman-0 aa8f631b5910 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-backup:20200130.1-hotfix /bin/bash /usr/lo... 11 hours ago Up 11 hours ago openstack-cinder-backup-podman-0 a7665faeb2f8 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-haproxy:20200130.1-hotfix /bin/bash /usr/lo... 11 hours ago Up 11 hours ago haproxy-bundle-podman-0 [heat-admin@controller-1 ~]$ echo and has the right version and has the right version [heat-admin@controller-1 ~]$ sudo podman exec -ti haproxy-bundle-podman-0 rpm -qa | grep ^pacemaker pacemaker-cluster-libs-2.0.3-5.el8.x86_64 pacemaker-schemas-2.0.3-5.el8.noarch pacemaker-cli-2.0.3-5.el8.x86_64 pacemaker-2.0.3-5.el8.x86_64 pacemaker-libs-2.0.3-5.el8.x86_64 pacemaker-remote-2.0.3-5.el8.x86_64 # on the host too [heat-admin@controller-1 ~]$ rpm -qa | grep ^pacemaker pacemaker-2.0.3-5.el8.x86_64 pacemaker-cli-2.0.3-5.el8.x86_64 pacemaker-libs-2.0.3-5.el8.x86_64 pacemaker-remote-2.0.3-5.el8.x86_64 pacemaker-schemas-2.0.3-5.el8.noarch pacemaker-cluster-libs-2.0.3-5.el8.x86_64 Thanks. [1] current version of osp16 is failing on some other things [2] got a transient error during last step of vm migration but it's completely unrelated in http://staging-jenkins2-qe-playground.usersys.redhat.com/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-composable-ipv6/ job 5 [3] on the same ci, job 4 passed completely but was using pacemaker-2.0.3-4.el8.1.x86_64 in the container. marking verified in pacemaker-2.0.3-5.el8 as per comment#15 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:1609 |