Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Cause:
Communication between pcsd components times out if there is no response in 30 seconds.
Consequence:
Actions performed by pcsd taking longer than 30 seconds never succeed.
Fix:
Remove the timeout for internal pcsd communication.
Result:
Actions taking a long time work properly.
seems a pacemaker/pcs thing - here what happens on 16.1 / rhel8.2:
[root@overcloud-controller-0 ~]# pcs cluster stop --all
overcloud-controller-2: Error connecting to overcloud-controller-2 - (HTTP error: 500)
overcloud-controller-1: Error connecting to overcloud-controller-1 - (HTTP error: 500)
overcloud-controller-0: Error connecting to overcloud-controller-0 - (HTTP error: 500)
Error: unable to stop all nodes
overcloud-controller-2: Error connecting to overcloud-controller-2 - (HTTP error: 500)
overcloud-controller-1: Error connecting to overcloud-controller-1 - (HTTP error: 500)
overcloud-controller-0: Error connecting to overcloud-controller-0 - (HTTP error: 500)
[root@overcloud-controller-0 ~]#
despite the errors the cluster gets stopped correctly few seconds later:
[root@overcloud-controller-0 ~]# pcs status
Error: error running crm_mon, is pacemaker running?
Error: cluster is not available on this node
stop [--all | <node>... ] [--request-timeout=<seconds>]
Stop a cluster on specified node(s). If no nodes are specified then
stop a cluster on the local node. If --all is specified then stop
a cluster on all nodes. If the cluster is running resources which take
long time to stop then the stop request may time out before the cluster
actually stops. In that case you should consider setting
--request-timeout to a suitable value.
I've tried passing a --request-timeout=120 but that didn't help.
pcs-0.10.4-6.el8.x86_64
pacemaker-remote-2.0.3-5.el8.x86_64
pacemaker-cluster-libs-2.0.3-5.el8.x86_64
pacemaker-cli-2.0.3-5.el8.x86_64
ansible-pacemaker-1.0.4-0.20200324105818.5847167.el8ost.noarch
pacemaker-libs-2.0.3-5.el8.x86_64
puppet-pacemaker-0.8.1-0.20200428133428.d501b27.el8ost.noarch
pacemaker-2.0.3-5.el8.x86_64
We had a look with Luca this morning and we think this is an error in pcs shipped in RHEL8.2 (pcs-0.10.4-6.el8.x86_64). This does not reproduce with pcs shipped in RHEL8.1 (pcs-0.10.2-4.el8.x86_64). We are not sure yet if it's a default configuration change or if it's a regression.
When running "pcs cluster stop --all", internally pcs connect to the pcsd servers on all nodes to forward the stop command to all nodes. This logs an unexpected error at the command line...:
overcloud-controller-2: Error connecting to overcloud-controller-2 - (HTTP error: 500)
overcloud-controller-1: Error connecting to overcloud-controller-1 - (HTTP error: 500)
overcloud-controller-0: Error connecting to overcloud-controller-0 - (HTTP error: 500)
... and the real error is detailed in the pcsd log:
E, [2020-05-12T06:46:44.103 #00000] ERROR -- : Cannot connect to ruby daemon (message: 'HTTP 599: Empty reply from server'). Is it running?
E, [2020-05-12T06:46:44.104 #00000] ERROR -- : 500 POST /remote/cluster_stop (172.17.0.42) 30002.90ms
When dowgrading all cluster nodes to pcs 0.10.2-4, the same error doesn't reproduce.
Note: We also observe that stopping a cluster locally with "pcs cluster stop" does not produce the error, so regular minor update operation are not impacted.
We're going to try to reproduce on an empty pacemaker cluster, and if successful we'll file a RHEL bug for tracking the issue.
I confirm this is a pcs/pcsd issue and the two lines from pcsd.log describe it.
It cannot happen with RHEL 8.1 packages, because the ruby daemon in question has been added in 8.2 packages (bz1783106). Also, it cannot happen when stopping a cluster locally with "pcs cluster stop", because there is no pcsd involved (as long as the pcs command is run by root).
It is expected increasing --request-timeout has no effect in this case: connections to pcsd running on cluster nodes were established successfully and responses were received in less than a minute, which is the default timeout. Connections between the two local pcs daemons (python to ruby) are not affected by the --request-timeout value.
So far, I haven't been able to reproduce this. We will investigate this further. Does this happen all the time or just randomly here and there?
We have a reliable reproducer. This issue happens every time when stopping a cluster on a node takes more than 30 seconds.
Reproducer:
# pcs resource create test delay startdelay=1 stopdelay=35
# pcs cluster stop --all
Fixed reproducer:
# pcs resource create test delay startdelay=1 stopdelay=35 op stop timeout=40
# pcs cluster stop --all
The timeout must be longer than the stopdelay, otherwise the node with the resource gets fenced (which is a correct course of action, since the resource didn't stop in time).
Created attachment 1698193[details]
bzaf_auto_verification_output_06202020064418
Spec executed successfully
Verifying bug as VERIFIED VERIFIED
Verification output is attached to comment
Generated by bzaf 0.0.1.dev49
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (pcs bug fix and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2020:4617
seems a pacemaker/pcs thing - here what happens on 16.1 / rhel8.2: [root@overcloud-controller-0 ~]# pcs cluster stop --all overcloud-controller-2: Error connecting to overcloud-controller-2 - (HTTP error: 500) overcloud-controller-1: Error connecting to overcloud-controller-1 - (HTTP error: 500) overcloud-controller-0: Error connecting to overcloud-controller-0 - (HTTP error: 500) Error: unable to stop all nodes overcloud-controller-2: Error connecting to overcloud-controller-2 - (HTTP error: 500) overcloud-controller-1: Error connecting to overcloud-controller-1 - (HTTP error: 500) overcloud-controller-0: Error connecting to overcloud-controller-0 - (HTTP error: 500) [root@overcloud-controller-0 ~]# despite the errors the cluster gets stopped correctly few seconds later: [root@overcloud-controller-0 ~]# pcs status Error: error running crm_mon, is pacemaker running? Error: cluster is not available on this node stop [--all | <node>... ] [--request-timeout=<seconds>] Stop a cluster on specified node(s). If no nodes are specified then stop a cluster on the local node. If --all is specified then stop a cluster on all nodes. If the cluster is running resources which take long time to stop then the stop request may time out before the cluster actually stops. In that case you should consider setting --request-timeout to a suitable value. I've tried passing a --request-timeout=120 but that didn't help. pcs-0.10.4-6.el8.x86_64 pacemaker-remote-2.0.3-5.el8.x86_64 pacemaker-cluster-libs-2.0.3-5.el8.x86_64 pacemaker-cli-2.0.3-5.el8.x86_64 ansible-pacemaker-1.0.4-0.20200324105818.5847167.el8ost.noarch pacemaker-libs-2.0.3-5.el8.x86_64 puppet-pacemaker-0.8.1-0.20200428133428.d501b27.el8ost.noarch pacemaker-2.0.3-5.el8.x86_64