Bug 1833506

Summary: pcs cluster stop --all throws errors and doesn't seem to honor the request-timeout option
Product: Red Hat Enterprise Linux 8 Reporter: Pavan <pkesavar>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 8.2CC: aschultz, cfeist, cluster-maint, dciabrin, idevat, jbadiapa, jschluet, lmiccini, mlisik, mpospisi, nhostako, omular, pkomarov, tojeline
Target Milestone: rcKeywords: Regression, ZStream
Target Release: 8.2Flags: pm-rhel: mirror+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: pcs-0.10.6-1.el8 Doc Type: Bug Fix
Doc Text:
Cause: Communication between pcsd components times out if there is no response in 30 seconds. Consequence: Actions performed by pcsd taking longer than 30 seconds never succeed. Fix: Remove the timeout for internal pcsd communication. Result: Actions taking a long time work properly.
Story Points: ---
Clone Of:
: 1838084 (view as bug list) Environment:
Last Closed: 2020-11-04 02:28:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1837038, 1838084    
Attachments:
Description Flags
proposed fix
none
bzaf_auto_verification_output_06202020064418 none

Comment 3 Luca Miccini 2020-05-11 15:09:56 UTC
seems a pacemaker/pcs thing - here what happens on 16.1 / rhel8.2:

[root@overcloud-controller-0 ~]# pcs cluster stop --all
overcloud-controller-2: Error connecting to overcloud-controller-2 - (HTTP error: 500)
overcloud-controller-1: Error connecting to overcloud-controller-1 - (HTTP error: 500)
overcloud-controller-0: Error connecting to overcloud-controller-0 - (HTTP error: 500)
Error: unable to stop all nodes
overcloud-controller-2: Error connecting to overcloud-controller-2 - (HTTP error: 500)
overcloud-controller-1: Error connecting to overcloud-controller-1 - (HTTP error: 500)
overcloud-controller-0: Error connecting to overcloud-controller-0 - (HTTP error: 500)
[root@overcloud-controller-0 ~]# 

despite the errors the cluster gets stopped correctly few seconds later:

[root@overcloud-controller-0 ~]# pcs status
Error: error running crm_mon, is pacemaker running?
  Error: cluster is not available on this node



    stop [--all | <node>... ] [--request-timeout=<seconds>]         
        Stop a cluster on specified node(s). If no nodes are specified then    
        stop a cluster on the local node. If --all is specified then stop   
        a cluster on all nodes. If the cluster is running resources which take
        long time to stop then the stop request may time out before the cluster
        actually stops. In that case you should consider setting               
        --request-timeout to a suitable value. 

I've tried passing a --request-timeout=120 but that didn't help.


pcs-0.10.4-6.el8.x86_64
pacemaker-remote-2.0.3-5.el8.x86_64
pacemaker-cluster-libs-2.0.3-5.el8.x86_64
pacemaker-cli-2.0.3-5.el8.x86_64
ansible-pacemaker-1.0.4-0.20200324105818.5847167.el8ost.noarch
pacemaker-libs-2.0.3-5.el8.x86_64
puppet-pacemaker-0.8.1-0.20200428133428.d501b27.el8ost.noarch
pacemaker-2.0.3-5.el8.x86_64

Comment 4 Damien Ciabrini 2020-05-12 07:16:26 UTC
We had a look with Luca this morning and we think this is an error in pcs shipped in RHEL8.2 (pcs-0.10.4-6.el8.x86_64). This does not reproduce with pcs shipped in RHEL8.1 (pcs-0.10.2-4.el8.x86_64). We are not sure yet if it's a default configuration change or if it's a regression. 

When running "pcs cluster stop --all", internally pcs connect to the pcsd servers on all nodes to forward the stop command to all nodes. This logs an unexpected error at the command line...:
overcloud-controller-2: Error connecting to overcloud-controller-2 - (HTTP error: 500)
overcloud-controller-1: Error connecting to overcloud-controller-1 - (HTTP error: 500)
overcloud-controller-0: Error connecting to overcloud-controller-0 - (HTTP error: 500)

... and the real error is detailed in the pcsd log:
E, [2020-05-12T06:46:44.103 #00000]    ERROR -- : Cannot connect to ruby daemon (message: 'HTTP 599: Empty reply from server'). Is it running?                              
E, [2020-05-12T06:46:44.104 #00000]    ERROR -- : 500 POST /remote/cluster_stop (172.17.0.42) 30002.90ms                                                                    


When dowgrading all cluster nodes to pcs 0.10.2-4, the same error doesn't reproduce.

Note: We also observe that stopping a cluster locally with "pcs cluster stop" does not produce the error, so regular minor update operation are not impacted.


We're going to try to reproduce on an empty pacemaker cluster, and if successful we'll file a RHEL bug for tracking the issue.

Comment 5 Tomas Jelinek 2020-05-12 08:26:03 UTC
I confirm this is a pcs/pcsd issue and the two lines from pcsd.log describe it.

It cannot happen with RHEL 8.1 packages, because the ruby daemon in question has been added in 8.2 packages (bz1783106). Also, it cannot happen when stopping a cluster locally with "pcs cluster stop", because there is no pcsd involved (as long as the pcs command is run by root).

It is expected increasing --request-timeout has no effect in this case: connections to pcsd running on cluster nodes were established successfully and responses were received in less than a minute, which is the default timeout. Connections between the two local pcs daemons (python to ruby) are not affected by the --request-timeout value.

So far, I haven't been able to reproduce this. We will investigate this further. Does this happen all the time or just randomly here and there?

Comment 6 Tomas Jelinek 2020-05-12 08:40:06 UTC
We have a reliable reproducer. This issue happens every time when stopping a cluster on a node takes more than 30 seconds.

Reproducer:
# pcs resource create test delay startdelay=1 stopdelay=35
# pcs cluster stop --all

Comment 7 Tomas Jelinek 2020-05-12 09:00:37 UTC
Fixed reproducer:
# pcs resource create test delay startdelay=1 stopdelay=35 op stop timeout=40
# pcs cluster stop --all

The timeout must be longer than the stopdelay, otherwise the node with the resource gets fenced (which is a correct course of action, since the resource didn't stop in time).

Comment 8 Tomas Jelinek 2020-05-12 10:50:54 UTC
*** Bug 1834696 has been marked as a duplicate of this bug. ***

Comment 9 Tomas Jelinek 2020-05-14 15:09:29 UTC
Created attachment 1688511 [details]
proposed fix

Reproducer / test in comment 7

Comment 12 Michele Baldessari 2020-05-18 18:43:22 UTC
*** Bug 1837038 has been marked as a duplicate of this bug. ***

Comment 17 Miroslav Lisik 2020-06-11 14:22:40 UTC
Test:

[root@r8-node-01 pcs]# rpm -q pcs
pcs-0.10.6-1.el8.x86_64

[root@r8-node-01 pcs]# pcs resource create test ocf:heartbeat:Delay startdelay=1 stopdelay=35 op stop timeout=40
[root@r8-node-01 pcs]# pcs resource
  * test        (ocf::heartbeat:Delay): Started r8-node-01
[root@r8-node-01 pcs]# time pcs cluster stop --all
r8-node-02: Stopping Cluster (pacemaker)...
r8-node-01: Stopping Cluster (pacemaker)...
r8-node-02: Stopping Cluster (corosync)...
r8-node-01: Stopping Cluster (corosync)...

real    0m39.025s
user    0m0.575s
sys     0m0.065s

Comment 22 pkomarov 2020-06-20 10:44:21 UTC
Created attachment 1698193 [details]
bzaf_auto_verification_output_06202020064418

Spec executed successfully
Verifying bug as VERIFIED VERIFIED
Verification output is attached to comment

Generated by bzaf 0.0.1.dev49

Comment 25 errata-xmlrpc 2020-11-04 02:28:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4617