RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1644076 - Rebooting controller nodes hangs for 20 minutes when running 'pcs cluster stop'
Summary: Rebooting controller nodes hangs for 20 minutes when running 'pcs cluster stop'
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: pre-dev-freeze
: 7.7
Assignee: Ken Gaillot
QA Contact: pkomarov
URL:
Whiteboard:
Depends On:
Blocks: 1646346 1646347 1654601
TreeView+ depends on / blocked
 
Reported: 2018-10-29 23:06 UTC by Marius Cornea
Modified: 2019-08-06 12:54 UTC (History)
15 users (show)

Fixed In Version: pacemaker-1.1.20-1.el7
Doc Type: Bug Fix
Doc Text:
Cause: Pacemaker now verifies a node has resource information for a resource before synthesizing an internal result for an action on that resource. Consequence: This uncovered a pre-existing bug where clone notify actions on a Pacemaker Remote node would be routed through the wrong cluster node when the Pacemaker Remote connection is moving, causing the cluster to indefinitely loop trying to move the connection. Fix: Notify actions are now routed through the proper cluster node when a Pacemaker Remote connection is moving. Result: Notify actions are completed successfully, and the Pacemaker Remote connection is able to move successfully.
Clone Of:
: 1646346 1646347 1654601 (view as bug list)
Environment:
Last Closed: 2019-08-06 12:53:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3703011 0 None None None 2018-11-19 18:13:42 UTC
Red Hat Product Errata RHBA-2019:2129 0 None None None 2019-08-06 12:54:11 UTC

Description Marius Cornea 2018-10-29 23:06:57 UTC
Description of problem:

Rebooting controller nodes according to [1] gets stuck for 20minutes while running 'pcs cluster stop'. This issue appears to be a regression showing up after the switch to RHEL 7.6 overcloud images.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/director_installation_and_usage/#sect-Rebooting-Controller

Version-Release number of selected component (if applicable):
2018-10-25.3 puddle

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 controllers
2. SSH to one of the controllers
3. time sudo pcs cluster stop

Actual results:
[heat-admin@controller-2 ~]$ time sudo pcs cluster stop
Stopping Cluster (pacemaker)...
Stopping Cluster (corosync)...

real    20m1.692s
user    0m0.173s
sys     0m0.086s


Expected results:
Response time in a matter of seconds.

Additional info:
Attaching sosreport.

Comment 7 Andrew Beekhof 2018-10-31 02:00:31 UTC
It is suspicious that it's only the notify actions, and only one one machine. that aren't being seen to complete.

Not sure I buy into comment #6 because the lrmd (local and remote) don't subscribe to cib updates.  There's also just not that many updates for the crmd to be bogged down with either.

Hard to triage though, the sos reports seem to be from different incidents:
  Oct 30 14:23 on controller-1 
and
  Oct 29 22:41 on controller-2

Can we get consistent logs please?

Comment 13 Ken Gaillot 2018-10-31 22:00:27 UTC
Would it be possible reproduce this on a 3-node 7.5 cluster and attach /var/log/cluster/corosync.log from each of the nodes?

There appears to be a pre-existing bug in how the cluster picks which host gets a remote command when the remote connection is moving, but for whatever reason the 7.5 packages cleanly recovered from the fallout, whereas the 7.6 packages do not. The 7.5 logs would help me figure out what's different.

Using both the 7.5 and 7.6 packages to run crm_simulate --save-graph using the controller-2's pe-input-79.bz2 input from the 3-node reproducer, the cluster wrongly routes the pre-notify action for bundle instance 0 through its ultimate destination (controller-1) and not its current location (controller-0). For example (note the router_node):

<rsc_op id="205" operation="notify" operation_key="rabbitmq_pre_notify_stop_0" internal_operation_key="rabbitmq:0_pre_notify_stop_0" on_node="rabbitmq-bundle-0" on_node_uuid="rabbitmq-bundle-0" router_node="controller-1">

So, there will likely be two bugs to fix here, the command routing issue (which appears to be longstanding), and the handling of a bad routing (which may be a regression).

Comment 16 Andrew Beekhof 2018-11-01 08:19:04 UTC
# crm_mon -x ~/pe-input-79.bz2 -R
[...]
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest]
   Replica[0]
      redis-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-0
      redis-bundle-0	(ocf::pacemaker:remote):	Started controller-0
      redis	(ocf::heartbeat:redis):	Master redis-bundle-0

# crm_simulate -Sx ~/pe-input-79.bz2 -G 79.xml

# grep redis_pre_notify_demote_0.*bundle-0 79.xml 
      <rsc_op id="230" operation="notify" operation_key="redis_pre_notify_demote_0" internal_operation_key="redis:0_pre_notify_demote_0" on_node="redis-bundle-0" on_node_uuid="redis-bundle-0" router_node="controller-1">


Specifically: router_node="controller-1"

See also:

Oct 31 11:54:02 [23528] controller-2       crmd:    error: print_synapse:	[Action  215]: In-flight rsc op redis_pre_notify_demote_0         on redis-bundle-0 (priority: 0, waiting: none)

Oct 31 11:51:32 [23525] controller-1       crmd:     info: synthesize_lrmd_failure:	Faking redis_notify_0 result (0) on redis-bundle-0


Also, this appears to be specific to notifications because router_node is set correctly for the stop and demote actions:

      <rsc_op id="138" operation="stop" operation_key="redis_stop_0" internal_operation_key="redis:0_stop_0" on_node="redis-bundle-0" on_node_uuid="redis-bundle-0" router_node="controller-0">
        <rsc_op id="137" operation="demote" operation_key="redis_demote_0" internal_operation_key="redis:0_demote_0" on_node="redis-bundle-0" on_node_uuid="redis-bundle-0" router_node="controller-0"/>


Originally this input would have resulted in a borked transition.
This was fixed some time prior to d28399b but doesn't appear to be sending out notifications at that point.

The notification started being sent as of 31d4f2e01d47312ff21da845f0fc6816630c00ec, but they had the wrong router_node set.

Comment 17 Andrew Beekhof 2018-11-01 08:22:09 UTC
Seems the logic in get_router_node() is insufficient

Comment 18 Andrew Beekhof 2018-11-01 11:09:10 UTC
Fixed in https://github.com/ClusterLabs/pacemaker/commit/a21fd00

Comment 19 Ken Gaillot 2018-11-01 19:31:42 UTC
The difference from 7.5 to 7.6 is a side effect of unrelated bug fixes improving fail-safe checking of faked results. In 7.5 in this situation, the faked result would be unconditionally processed. In 7.6, we first check whether the local daemon has resource info for the result being faked; since the wrong node is being consulted, that info doesn't exist, and the fake result is not processed.

It's debatable whether the new approach is better, but I tend to think highlighting problems like this is better than papering them over. Of course I wish it had turned up in testing, but we'll use this case as a regression test going forward.

Note that even in 7.5, the notify action is not being called when it should, instead the wrong router node is successfully faking the result. So, I'll try to get 7.5.z as well as 7.6.z.

Comment 22 Ken Gaillot 2018-11-06 01:17:25 UTC
This has been fixed in the upstream 1.1 branch by commit 2ba3fffc (backport of master branch commit a21fd00)

Comment 26 Ken Gaillot 2019-02-04 20:04:21 UTC
QA: To reproduce, configure a cluster with at least three cluster nodes, and a guest node and/or bundle. Run "pcs cluster stop" on one of the cluster nodes that is running the guest and/or bundle. (I expect the problem would also appear if you ban the guest node and/or bundle from one of the cluster nodes running it.)

Before the fix, the operation will not complete, and the node will eventually force-exit after a timeout. After the fix, shutdown proceeds normally.

Comment 27 pkomarov 2019-02-25 07:02:46 UTC
Verified , initial state : 

(undercloud) [stack@undercloud-0 ~]$ ansible controller -m shell -b -a 'rpm -qa|grep pacemaker'
 [WARNING]: Found both group and host with same name: undercloud

 [WARNING]: Consider using the yum, dnf or zypper module rather than running rpm.  If you need to use command because
yum, dnf or zypper is insufficient you can add warn=False to this command task or set command_warnings=False in
ansible.cfg to get rid of this message.

controller-0 | SUCCESS | rc=0 >>
pacemaker-cli-1.1.20-1.el7.x86_64
pacemaker-remote-1.1.20-1.el7.x86_64
pacemaker-1.1.20-1.el7.x86_64
pacemaker-cluster-libs-1.1.20-1.el7.x86_64
puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
pacemaker-libs-1.1.20-1.el7.x86_64

controller-2 | SUCCESS | rc=0 >>
pacemaker-cli-1.1.20-1.el7.x86_64
pacemaker-remote-1.1.20-1.el7.x86_64
pacemaker-1.1.20-1.el7.x86_64
pacemaker-cluster-libs-1.1.20-1.el7.x86_64
puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
pacemaker-libs-1.1.20-1.el7.x86_64

controller-1 | SUCCESS | rc=0 >>
pacemaker-cli-1.1.20-1.el7.x86_64
pacemaker-remote-1.1.20-1.el7.x86_64
pacemaker-1.1.20-1.el7.x86_64
pacemaker-cluster-libs-1.1.20-1.el7.x86_64
puppet-pacemaker-0.7.2-0.20181008172520.9a4bc2d.el7ost.noarch
ansible-pacemaker-1.0.4-0.20180827141254.0e4d7c0.el7ost.noarch
pacemaker-libs-1.1.20-1.el7.x86_64



(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'pcs status'

controller-1 | SUCCESS | rc=0 >>
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.20-1.el7-1642a7f847) - partition with quorum
Last updated: Mon Feb 25 06:57:02 2019
Last change: Mon Feb 25 06:52:20 2019 by hacluster via crmd on controller-2

12 nodes configured
37 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-2 galera-bundle-1@controller-0 galera-bundle-2@controller-1 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-0 rabbitmq-bundle-2@controller-1 redis-bundle-0@controller-2 redis-bundle-1@controller-0 redis-bundle-2@controller-1 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-1
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Slave controller-2
   redis-bundle-1	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-1
 ip-192.168.24.14	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.12	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.21	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.3.23	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.4.30	(ocf::heartbeat:IPaddr2):	Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-2
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-0
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-1
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'cat /etc/*release*'
 [WARNING]: Found both group and host with same name: undercloud

controller-1 | SUCCESS | rc=0 >>
NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.6"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)"

#check time of : pcs cluster stop:

(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'time pcs cluster stop '
 [WARNING]: Found both group and host with same name: undercloud

controller-1 | SUCCESS | rc=0 >>
Stopping Cluster (pacemaker)...
Stopping Cluster (corosync)...
real	0m44.985s
user	0m0.200s
sys	0m0.123s

(undercloud) [stack@undercloud-0 ~]$ ansible controller-2 -m shell -b -a 'pcs status'
 [WARNING]: Found both group and host with same name: undercloud

controller-2 | SUCCESS | rc=0 >>
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.20-1.el7-1642a7f847) - partition with quorum
Last updated: Mon Feb 25 07:01:34 2019
Last change: Mon Feb 25 06:52:20 2019 by hacluster via crmd on controller-2

12 nodes configured
37 resources configured

Online: [ controller-0 controller-2 ]
OFFLINE: [ controller-1 ]
GuestOnline: [ galera-bundle-0@controller-2 galera-bundle-1@controller-0 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-0 redis-bundle-0@controller-2 redis-bundle-1@controller-0 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Stopped
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-2	(ocf::heartbeat:galera):	Stopped
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Slave controller-2
   redis-bundle-1	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-2	(ocf::heartbeat:redis):	Stopped
 ip-192.168.24.14	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.12	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.21	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.3.23	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.4.30	(ocf::heartbeat:IPaddr2):	Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-2
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-0
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Stopped
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -m shell -b -a 'time pcs cluster start '
 [WARNING]: Found both group and host with same name: undercloud

controller-1 | SUCCESS | rc=0 >>
Starting Cluster (corosync)...
Starting Cluster (pacemaker)...
real	0m1.052s
user	0m0.198s
sys	0m0.064s

Comment 29 errata-xmlrpc 2019-08-06 12:53:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2129


Note You need to log in before you can comment on or make changes to this bug.