Bug 1834696

Summary: new pcs' shorter operation timeout can make global cluster stop fail
Product: Red Hat Enterprise Linux 8 Reporter: Damien Ciabrini <dciabrin>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.2CC: cfeist, cluster-maint, idevat, mlisik, mpospisi, omular, tojeline
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-12 10:50:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Damien Ciabrini 2020-05-12 09:13:14 UTC
Description of problem:
As reported in [1], when we try to stop our entire Openstack cluster
with "pcs cluster stop --all" with pcs 0.10.4 on RHEL 8.2, the cluster
is stopped but pcs returns an error at the command line:

[root@rhel1 ~]# pcs cluster stop --all
rhel2: Stopping Cluster (pacemaker)...
rhel1: Error connecting to rhel1 - (HTTP error: 500)
rhel3: Error connecting to rhel3 - (HTTP error: 500)
Error: unable to stop all nodes
rhel1: Error connecting to rhel1 - (HTTP error: 500)
rhel3: Error connecting to rhel3 - (HTTP error: 500)
rhel1: Not stopping cluster - node is unreachable
rhel3: Not stopping cluster - node is unreachable
rhel2: Stopping Cluster (corosync)...
Error: unable to stop all nodes

The http error is detailed in pcsd logs:

[root@rhel1 ~]# tail -f /var/log/pcsd/pcsd.log 
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : SRWT Node: rhel3 Request: get_configs
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : SRWT Node: rhel2 Request: get_configs
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Connecting to: https://rhel3:2224/remote/get_configs?cluster_name=ratester
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Connecting to: https://rhel2:2224/remote/get_configs?cluster_name=ratester
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Connecting to: https://rhel1:2224/remote/get_configs?cluster_name=ratester
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Config files sync finished
I, [2020-05-12T04:50:37.364 #00000]     INFO -- : 200 GET /remote/get_configs?cluster_name=ratester (172.22.38.234) 3.27ms
I, [2020-05-12T04:50:55.454 #00000]     INFO -- : 200 GET /remote/get_configs?cluster_name=ratester (172.22.38.235) 3.44ms
E, [2020-05-12T04:51:18.333 #00000]    ERROR -- : Cannot connect to ruby daemon (message: 'HTTP 599: Empty reply from server'). Is it running?
E, [2020-05-12T04:51:18.335 #00000]    ERROR -- : 500 POST /remote/cluster_stop (172.22.38.233) 30004.61ms

It looks like if the cluster operation takes more than 30s, pcsd will return an error.

This is a change compared to the previous behaviour from e.g. RHEL8.1, and it prevents us from stopping our clusters globally (as it is fairly big) in one single command.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1833506

Version-Release number of selected component (if applicable):
pcs-0.10.4-6.el8.x86_64


How reproducible:
Always


Steps to Reproduce:
1. create a empty cluster <rhel1,rhel2,rhel3>:

    pcs cluster setup ratester --force rhel1 rhel2 rhel3
    pcs cluster start --all
    pcs property set stonith-enabled=false

2. create a dummy resource and force its stop timeout to be 100s

    pcs resource create dummy ocf:pacemaker:Dummy --disabled
    pcs resource update dummy meta op stop timeout=100s
    pcs resource enable dummy

3. change file /usr/lib/ocf/resource.d/pacemaker/Dummy on all nodes to add a 40s delay in the stop operation:
    dummy_stop() {
        dummy_monitor --force
>>>        sleep 40

4. try to stop the cluster globally

    pcs cluster stop --all


Actual results:
The global cluster stop operation finishes with an error due to a timeout in pcsd


Expected results:
Cluster stop should finish successfully

Comment 1 Tomas Jelinek 2020-05-12 10:50:54 UTC
While I was working and commenting on the original bz1833506, Damien created this one. Since there is a discussion in the original bz, I'm closing this one as a duplicate.

*** This bug has been marked as a duplicate of bug 1833506 ***