Description of problem: Rebooting controller nodes according to [1] gets stuck for 20minutes while running 'pcs cluster stop'. This issue appears to be a regression showing up after the switch to RHEL 7.6 overcloud images. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/director_installation_and_usage/#sect-Rebooting-Controller Version-Release number of selected component (if applicable): 2018-10-25.3 puddle How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with 3 controllers 2. SSH to one of the controllers 3. time sudo pcs cluster stop Actual results: [heat-admin@controller-2 ~]$ time sudo pcs cluster stop Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... real 20m1.692s user 0m0.173s sys 0m0.086s Expected results: Response time in a matter of seconds. Additional info: Attaching sosreport.
It is suspicious that it's only the notify actions, and only one one machine. that aren't being seen to complete. Not sure I buy into comment #6 because the lrmd (local and remote) don't subscribe to cib updates. There's also just not that many updates for the crmd to be bogged down with either. Hard to triage though, the sos reports seem to be from different incidents: Oct 30 14:23 on controller-1 and Oct 29 22:41 on controller-2 Can we get consistent logs please?
Would it be possible reproduce this on a 3-node 7.5 cluster and attach /var/log/cluster/corosync.log from each of the nodes? There appears to be a pre-existing bug in how the cluster picks which host gets a remote command when the remote connection is moving, but for whatever reason the 7.5 packages cleanly recovered from the fallout, whereas the 7.6 packages do not. The 7.5 logs would help me figure out what's different. Using both the 7.5 and 7.6 packages to run crm_simulate --save-graph using the controller-2's pe-input-79.bz2 input from the 3-node reproducer, the cluster wrongly routes the pre-notify action for bundle instance 0 through its ultimate destination (controller-1) and not its current location (controller-0). For example (note the router_node): <rsc_op id="205" operation="notify" operation_key="rabbitmq_pre_notify_stop_0" internal_operation_key="rabbitmq:0_pre_notify_stop_0" on_node="rabbitmq-bundle-0" on_node_uuid="rabbitmq-bundle-0" router_node="controller-1"> So, there will likely be two bugs to fix here, the command routing issue (which appears to be longstanding), and the handling of a bad routing (which may be a regression).
# crm_mon -x ~/pe-input-79.bz2 -R [...] Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest] Replica[0] redis-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0 redis-bundle-0 (ocf::pacemaker:remote): Started controller-0 redis (ocf::heartbeat:redis): Master redis-bundle-0 # crm_simulate -Sx ~/pe-input-79.bz2 -G 79.xml # grep redis_pre_notify_demote_0.*bundle-0 79.xml <rsc_op id="230" operation="notify" operation_key="redis_pre_notify_demote_0" internal_operation_key="redis:0_pre_notify_demote_0" on_node="redis-bundle-0" on_node_uuid="redis-bundle-0" router_node="controller-1"> Specifically: router_node="controller-1" See also: Oct 31 11:54:02 [23528] controller-2 crmd: error: print_synapse: [Action 215]: In-flight rsc op redis_pre_notify_demote_0 on redis-bundle-0 (priority: 0, waiting: none) Oct 31 11:51:32 [23525] controller-1 crmd: info: synthesize_lrmd_failure: Faking redis_notify_0 result (0) on redis-bundle-0 Also, this appears to be specific to notifications because router_node is set correctly for the stop and demote actions: <rsc_op id="138" operation="stop" operation_key="redis_stop_0" internal_operation_key="redis:0_stop_0" on_node="redis-bundle-0" on_node_uuid="redis-bundle-0" router_node="controller-0"> <rsc_op id="137" operation="demote" operation_key="redis_demote_0" internal_operation_key="redis:0_demote_0" on_node="redis-bundle-0" on_node_uuid="redis-bundle-0" router_node="controller-0"/> Originally this input would have resulted in a borked transition. This was fixed some time prior to d28399b but doesn't appear to be sending out notifications at that point. The notification started being sent as of 31d4f2e01d47312ff21da845f0fc6816630c00ec, but they had the wrong router_node set.
Seems the logic in get_router_node() is insufficient
Fixed in https://github.com/ClusterLabs/pacemaker/commit/a21fd00
The difference from 7.5 to 7.6 is a side effect of unrelated bug fixes improving fail-safe checking of faked results. In 7.5 in this situation, the faked result would be unconditionally processed. In 7.6, we first check whether the local daemon has resource info for the result being faked; since the wrong node is being consulted, that info doesn't exist, and the fake result is not processed. It's debatable whether the new approach is better, but I tend to think highlighting problems like this is better than papering them over. Of course I wish it had turned up in testing, but we'll use this case as a regression test going forward. Note that even in 7.5, the notify action is not being called when it should, instead the wrong router node is successfully faking the result. So, I'll try to get 7.5.z as well as 7.6.z.
This has been fixed in the upstream 1.1 branch by commit 2ba3fffc (backport of master branch commit a21fd00)
QA: To reproduce, configure a cluster with at least three cluster nodes, and a guest node and/or bundle. Run "pcs cluster stop" on one of the cluster nodes that is running the guest and/or bundle. (I expect the problem would also appear if you ban the guest node and/or bundle from one of the cluster nodes running it.) Before the fix, the operation will not complete, and the node will eventually force-exit after a timeout. After the fix, shutdown proceeds normally.
Verified , initial state : (undercloud) [stack@undercloud-0 ~]$ ansible controller -m shell -b -a 'rpm -qa|grep pacemaker' [WARNING]: Found both group and host with same name: undercloud [WARNING]: Consider using the yum, dnf or zypper module rather than running rpm. If you need to use command because yum, dnf or zypper is insufficient you can add warn=False to this command task or set command_warnings=False in ansible.cfg to get rid of this message. controller-0 | SUCCESS | rc=0 >> pacemaker-cli-1.1.20-1.el7.x86_64 pacemaker-remote-1.1.20-1.el7.x86_64 pacemaker-1.1.20-1.el7.x86_64 pacemaker-cluster-libs-1.1.20-1.el7.x86_64 puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch pacemaker-libs-1.1.20-1.el7.x86_64 controller-2 | SUCCESS | rc=0 >> pacemaker-cli-1.1.20-1.el7.x86_64 pacemaker-remote-1.1.20-1.el7.x86_64 pacemaker-1.1.20-1.el7.x86_64 pacemaker-cluster-libs-1.1.20-1.el7.x86_64 puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch pacemaker-libs-1.1.20-1.el7.x86_64 controller-1 | SUCCESS | rc=0 >> pacemaker-cli-1.1.20-1.el7.x86_64 pacemaker-remote-1.1.20-1.el7.x86_64 pacemaker-1.1.20-1.el7.x86_64 pacemaker-cluster-libs-1.1.20-1.el7.x86_64 puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch pacemaker-libs-1.1.20-1.el7.x86_64 (undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'pcs status' controller-1 | SUCCESS | rc=0 >> Cluster name: tripleo_cluster Stack: corosync Current DC: controller-2 (version 1.1.20-1.el7-1642a7f847) - partition with quorum Last updated: Mon Feb 25 06:57:02 2019 Last change: Mon Feb 25 06:52:20 2019 by hacluster via crmd on controller-2 12 nodes configured 37 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-2 galera-bundle-1@controller-0 galera-bundle-2@controller-1 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-0 rabbitmq-bundle-2@controller-1 redis-bundle-0@controller-2 redis-bundle-1@controller-0 redis-bundle-2@controller-1 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-2 galera-bundle-1 (ocf::heartbeat:galera): Master controller-0 galera-bundle-2 (ocf::heartbeat:galera): Master controller-1 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Slave controller-2 redis-bundle-1 (ocf::heartbeat:redis): Master controller-0 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-1 ip-192.168.24.14 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.12 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.21 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.3.23 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.4.30 (ocf::heartbeat:IPaddr2): Started controller-2 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-2 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-0 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-1 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled (undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'cat /etc/*release*' [WARNING]: Found both group and host with same name: undercloud controller-1 | SUCCESS | rc=0 >> NAME="Red Hat Enterprise Linux Server" VERSION="7.6 (Maipo)" ID="rhel" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="7.6" PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)" #check time of : pcs cluster stop: (undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'time pcs cluster stop ' [WARNING]: Found both group and host with same name: undercloud controller-1 | SUCCESS | rc=0 >> Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... real 0m44.985s user 0m0.200s sys 0m0.123s (undercloud) [stack@undercloud-0 ~]$ ansible controller-2 -m shell -b -a 'pcs status' [WARNING]: Found both group and host with same name: undercloud controller-2 | SUCCESS | rc=0 >> Cluster name: tripleo_cluster Stack: corosync Current DC: controller-2 (version 1.1.20-1.el7-1642a7f847) - partition with quorum Last updated: Mon Feb 25 07:01:34 2019 Last change: Mon Feb 25 06:52:20 2019 by hacluster via crmd on controller-2 12 nodes configured 37 resources configured Online: [ controller-0 controller-2 ] OFFLINE: [ controller-1 ] GuestOnline: [ galera-bundle-0@controller-2 galera-bundle-1@controller-0 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-0 redis-bundle-0@controller-2 redis-bundle-1@controller-0 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-2 galera-bundle-1 (ocf::heartbeat:galera): Master controller-0 galera-bundle-2 (ocf::heartbeat:galera): Stopped Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Slave controller-2 redis-bundle-1 (ocf::heartbeat:redis): Master controller-0 redis-bundle-2 (ocf::heartbeat:redis): Stopped ip-192.168.24.14 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.12 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.21 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.3.23 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.4.30 (ocf::heartbeat:IPaddr2): Started controller-2 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-2 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-0 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Stopped Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled (undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'time pcs cluster start ' [WARNING]: Found both group and host with same name: undercloud controller-1 | SUCCESS | rc=0 >> Starting Cluster (corosync)... Starting Cluster (pacemaker)... real 0m1.052s user 0m0.198s sys 0m0.064s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2129