Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1644076

Summary:	Rebooting controller nodes hangs for 20 minutes when running 'pcs cluster stop'
Product:	Red Hat Enterprise Linux 7	Reporter:	Marius Cornea <mcornea>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED ERRATA	QA Contact:	pkomarov
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.6	CC:	abeekhof, aherr, cfeist, chjones, cluster-maint, ctowsley, dbecker, dciabrin, jeckersb, jpokorny, mburns, mkrcmari, morazi, pkomarov, yprokule
Target Milestone:	pre-dev-freeze	Keywords:	Triaged, ZStream
Target Release:	7.7
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pacemaker-1.1.20-1.el7	Doc Type:	Bug Fix
Doc Text:	Cause: Pacemaker now verifies a node has resource information for a resource before synthesizing an internal result for an action on that resource. Consequence: This uncovered a pre-existing bug where clone notify actions on a Pacemaker Remote node would be routed through the wrong cluster node when the Pacemaker Remote connection is moving, causing the cluster to indefinitely loop trying to move the connection. Fix: Notify actions are now routed through the proper cluster node when a Pacemaker Remote connection is moving. Result: Notify actions are completed successfully, and the Pacemaker Remote connection is able to move successfully.	Story Points:	---
Clone Of:
Clones:	1646346 1646347 1654601 (view as bug list)		Environment:
Last Closed:	2019-08-06 12:53:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1646346, 1646347, 1654601

Description Marius Cornea 2018-10-29 23:06:57 UTC

Description of problem:

Rebooting controller nodes according to [1] gets stuck for 20minutes while running 'pcs cluster stop'. This issue appears to be a regression showing up after the switch to RHEL 7.6 overcloud images.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/director_installation_and_usage/#sect-Rebooting-Controller

Version-Release number of selected component (if applicable):
2018-10-25.3 puddle

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 controllers
2. SSH to one of the controllers
3. time sudo pcs cluster stop

Actual results:
[heat-admin@controller-2 ~]$ time sudo pcs cluster stop
Stopping Cluster (pacemaker)...
Stopping Cluster (corosync)...

real    20m1.692s
user    0m0.173s
sys     0m0.086s


Expected results:
Response time in a matter of seconds.

Additional info:
Attaching sosreport.

Comment 7 Andrew Beekhof 2018-10-31 02:00:31 UTC

It is suspicious that it's only the notify actions, and only one one machine. that aren't being seen to complete.

Not sure I buy into comment #6 because the lrmd (local and remote) don't subscribe to cib updates.  There's also just not that many updates for the crmd to be bogged down with either.

Hard to triage though, the sos reports seem to be from different incidents:
  Oct 30 14:23 on controller-1 
and
  Oct 29 22:41 on controller-2

Can we get consistent logs please?

Comment 13 Ken Gaillot 2018-10-31 22:00:27 UTC

Would it be possible reproduce this on a 3-node 7.5 cluster and attach /var/log/cluster/corosync.log from each of the nodes?

There appears to be a pre-existing bug in how the cluster picks which host gets a remote command when the remote connection is moving, but for whatever reason the 7.5 packages cleanly recovered from the fallout, whereas the 7.6 packages do not. The 7.5 logs would help me figure out what's different.

Using both the 7.5 and 7.6 packages to run crm_simulate --save-graph using the controller-2's pe-input-79.bz2 input from the 3-node reproducer, the cluster wrongly routes the pre-notify action for bundle instance 0 through its ultimate destination (controller-1) and not its current location (controller-0). For example (note the router_node):

<rsc_op id="205" operation="notify" operation_key="rabbitmq_pre_notify_stop_0" internal_operation_key="rabbitmq:0_pre_notify_stop_0" on_node="rabbitmq-bundle-0" on_node_uuid="rabbitmq-bundle-0" router_node="controller-1">

So, there will likely be two bugs to fix here, the command routing issue (which appears to be longstanding), and the handling of a bad routing (which may be a regression).

Comment 16 Andrew Beekhof 2018-11-01 08:19:04 UTC

# crm_mon -x ~/pe-input-79.bz2 -R
[...]
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest]
   Replica[0]
      redis-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-0
      redis-bundle-0	(ocf::pacemaker:remote):	Started controller-0
      redis	(ocf::heartbeat:redis):	Master redis-bundle-0

# crm_simulate -Sx ~/pe-input-79.bz2 -G 79.xml

# grep redis_pre_notify_demote_0.*bundle-0 79.xml 
      <rsc_op id="230" operation="notify" operation_key="redis_pre_notify_demote_0" internal_operation_key="redis:0_pre_notify_demote_0" on_node="redis-bundle-0" on_node_uuid="redis-bundle-0" router_node="controller-1">


Specifically: router_node="controller-1"

See also:

Oct 31 11:54:02 [23528] controller-2       crmd:    error: print_synapse:	[Action  215]: In-flight rsc op redis_pre_notify_demote_0         on redis-bundle-0 (priority: 0, waiting: none)

Oct 31 11:51:32 [23525] controller-1       crmd:     info: synthesize_lrmd_failure:	Faking redis_notify_0 result (0) on redis-bundle-0


Also, this appears to be specific to notifications because router_node is set correctly for the stop and demote actions:

      <rsc_op id="138" operation="stop" operation_key="redis_stop_0" internal_operation_key="redis:0_stop_0" on_node="redis-bundle-0" on_node_uuid="redis-bundle-0" router_node="controller-0">
        <rsc_op id="137" operation="demote" operation_key="redis_demote_0" internal_operation_key="redis:0_demote_0" on_node="redis-bundle-0" on_node_uuid="redis-bundle-0" router_node="controller-0"/>


Originally this input would have resulted in a borked transition.
This was fixed some time prior to d28399b but doesn't appear to be sending out notifications at that point.

The notification started being sent as of 31d4f2e01d47312ff21da845f0fc6816630c00ec, but they had the wrong router_node set.

Comment 17 Andrew Beekhof 2018-11-01 08:22:09 UTC

Seems the logic in get_router_node() is insufficient

Comment 18 Andrew Beekhof 2018-11-01 11:09:10 UTC

Fixed in https://github.com/ClusterLabs/pacemaker/commit/a21fd00

Comment 19 Ken Gaillot 2018-11-01 19:31:42 UTC

The difference from 7.5 to 7.6 is a side effect of unrelated bug fixes improving fail-safe checking of faked results. In 7.5 in this situation, the faked result would be unconditionally processed. In 7.6, we first check whether the local daemon has resource info for the result being faked; since the wrong node is being consulted, that info doesn't exist, and the fake result is not processed.

It's debatable whether the new approach is better, but I tend to think highlighting problems like this is better than papering them over. Of course I wish it had turned up in testing, but we'll use this case as a regression test going forward.

Note that even in 7.5, the notify action is not being called when it should, instead the wrong router node is successfully faking the result. So, I'll try to get 7.5.z as well as 7.6.z.

Comment 22 Ken Gaillot 2018-11-06 01:17:25 UTC

This has been fixed in the upstream 1.1 branch by commit 2ba3fffc (backport of master branch commit a21fd00)

Comment 26 Ken Gaillot 2019-02-04 20:04:21 UTC

QA: To reproduce, configure a cluster with at least three cluster nodes, and a guest node and/or bundle. Run "pcs cluster stop" on one of the cluster nodes that is running the guest and/or bundle. (I expect the problem would also appear if you ban the guest node and/or bundle from one of the cluster nodes running it.)

Before the fix, the operation will not complete, and the node will eventually force-exit after a timeout. After the fix, shutdown proceeds normally.

Comment 27 pkomarov 2019-02-25 07:02:46 UTC

Verified , initial state : 

(undercloud) [stack@undercloud-0 ~]$ ansible controller -m shell -b -a 'rpm -qa|grep pacemaker'
 [WARNING]: Found both group and host with same name: undercloud

 [WARNING]: Consider using the yum, dnf or zypper module rather than running rpm.  If you need to use command because
yum, dnf or zypper is insufficient you can add warn=False to this command task or set command_warnings=False in
ansible.cfg to get rid of this message.

controller-0 | SUCCESS | rc=0 >>
pacemaker-cli-1.1.20-1.el7.x86_64
pacemaker-remote-1.1.20-1.el7.x86_64
pacemaker-1.1.20-1.el7.x86_64
pacemaker-cluster-libs-1.1.20-1.el7.x86_64
puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
pacemaker-libs-1.1.20-1.el7.x86_64

controller-2 | SUCCESS | rc=0 >>
pacemaker-cli-1.1.20-1.el7.x86_64
pacemaker-remote-1.1.20-1.el7.x86_64
pacemaker-1.1.20-1.el7.x86_64
pacemaker-cluster-libs-1.1.20-1.el7.x86_64
puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
pacemaker-libs-1.1.20-1.el7.x86_64

controller-1 | SUCCESS | rc=0 >>
pacemaker-cli-1.1.20-1.el7.x86_64
pacemaker-remote-1.1.20-1.el7.x86_64
pacemaker-1.1.20-1.el7.x86_64
pacemaker-cluster-libs-1.1.20-1.el7.x86_64
puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
pacemaker-libs-1.1.20-1.el7.x86_64



(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'pcs status'

controller-1 | SUCCESS | rc=0 >>
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.20-1.el7-1642a7f847) - partition with quorum
Last updated: Mon Feb 25 06:57:02 2019
Last change: Mon Feb 25 06:52:20 2019 by hacluster via crmd on controller-2

12 nodes configured
37 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-2 galera-bundle-1@controller-0 galera-bundle-2@controller-1 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-0 rabbitmq-bundle-2@controller-1 redis-bundle-0@controller-2 redis-bundle-1@controller-0 redis-bundle-2@controller-1 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-1
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Slave controller-2
   redis-bundle-1	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-1
 ip-192.168.24.14	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.12	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.21	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.3.23	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.4.30	(ocf::heartbeat:IPaddr2):	Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-2
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-0
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-1
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'cat /etc/*release*'
 [WARNING]: Found both group and host with same name: undercloud

controller-1 | SUCCESS | rc=0 >>
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.6"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)"

#check time of : pcs cluster stop:

(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'time pcs cluster stop '
 [WARNING]: Found both group and host with same name: undercloud

controller-1 | SUCCESS | rc=0 >>
Stopping Cluster (pacemaker)...
Stopping Cluster (corosync)...
real	0m44.985s
user	0m0.200s
sys	0m0.123s

(undercloud) [stack@undercloud-0 ~]$ ansible controller-2 -m shell -b -a 'pcs status'
 [WARNING]: Found both group and host with same name: undercloud

controller-2 | SUCCESS | rc=0 >>
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.20-1.el7-1642a7f847) - partition with quorum
Last updated: Mon Feb 25 07:01:34 2019
Last change: Mon Feb 25 06:52:20 2019 by hacluster via crmd on controller-2

12 nodes configured
37 resources configured

Online: [ controller-0 controller-2 ]
OFFLINE: [ controller-1 ]
GuestOnline: [ galera-bundle-0@controller-2 galera-bundle-1@controller-0 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-0 redis-bundle-0@controller-2 redis-bundle-1@controller-0 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Stopped
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-2	(ocf::heartbeat:galera):	Stopped
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Slave controller-2
   redis-bundle-1	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-2	(ocf::heartbeat:redis):	Stopped
 ip-192.168.24.14	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.12	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.21	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.3.23	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.4.30	(ocf::heartbeat:IPaddr2):	Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-2
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-0
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Stopped
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'time pcs cluster start '
 [WARNING]: Found both group and host with same name: undercloud

controller-1 | SUCCESS | rc=0 >>
Starting Cluster (corosync)...
Starting Cluster (pacemaker)...
real	0m1.052s
user	0m0.198s
sys	0m0.064s

Comment 29 errata-xmlrpc 2019-08-06 12:53:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2129