Bug 1834696

Summary:	new pcs' shorter operation timeout can make global cluster stop fail
Product:	Red Hat Enterprise Linux 8	Reporter:	Damien Ciabrini <dciabrin>
Component:	pcs	Assignee:	Tomas Jelinek <tojeline>
Status:	CLOSED DUPLICATE	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	8.2	CC:	cfeist, cluster-maint, idevat, mlisik, mpospisi, omular, tojeline
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-12 10:50:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Damien Ciabrini 2020-05-12 09:13:14 UTC

Description of problem:
As reported in [1], when we try to stop our entire Openstack cluster
with "pcs cluster stop --all" with pcs 0.10.4 on RHEL 8.2, the cluster
is stopped but pcs returns an error at the command line:

[root@rhel1 ~]# pcs cluster stop --all
rhel2: Stopping Cluster (pacemaker)...
rhel1: Error connecting to rhel1 - (HTTP error: 500)
rhel3: Error connecting to rhel3 - (HTTP error: 500)
Error: unable to stop all nodes
rhel1: Error connecting to rhel1 - (HTTP error: 500)
rhel3: Error connecting to rhel3 - (HTTP error: 500)
rhel1: Not stopping cluster - node is unreachable
rhel3: Not stopping cluster - node is unreachable
rhel2: Stopping Cluster (corosync)...
Error: unable to stop all nodes

The http error is detailed in pcsd logs:

[root@rhel1 ~]# tail -f /var/log/pcsd/pcsd.log 
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : SRWT Node: rhel3 Request: get_configs
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : SRWT Node: rhel2 Request: get_configs
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Connecting to: https://rhel3:2224/remote/get_configs?cluster_name=ratester
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Connecting to: https://rhel2:2224/remote/get_configs?cluster_name=ratester
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Connecting to: https://rhel1:2224/remote/get_configs?cluster_name=ratester
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Config files sync finished
I, [2020-05-12T04:50:37.364 #00000]     INFO -- : 200 GET /remote/get_configs?cluster_name=ratester (172.22.38.234) 3.27ms
I, [2020-05-12T04:50:55.454 #00000]     INFO -- : 200 GET /remote/get_configs?cluster_name=ratester (172.22.38.235) 3.44ms
E, [2020-05-12T04:51:18.333 #00000]    ERROR -- : Cannot connect to ruby daemon (message: 'HTTP 599: Empty reply from server'). Is it running?
E, [2020-05-12T04:51:18.335 #00000]    ERROR -- : 500 POST /remote/cluster_stop (172.22.38.233) 30004.61ms

It looks like if the cluster operation takes more than 30s, pcsd will return an error.

This is a change compared to the previous behaviour from e.g. RHEL8.1, and it prevents us from stopping our clusters globally (as it is fairly big) in one single command.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1833506

Version-Release number of selected component (if applicable):
pcs-0.10.4-6.el8.x86_64


How reproducible:
Always


Steps to Reproduce:
1. create a empty cluster <rhel1,rhel2,rhel3>:

    pcs cluster setup ratester --force rhel1 rhel2 rhel3
    pcs cluster start --all
    pcs property set stonith-enabled=false

2. create a dummy resource and force its stop timeout to be 100s

    pcs resource create dummy ocf:pacemaker:Dummy --disabled
    pcs resource update dummy meta op stop timeout=100s
    pcs resource enable dummy

3. change file /usr/lib/ocf/resource.d/pacemaker/Dummy on all nodes to add a 40s delay in the stop operation:
    dummy_stop() {
        dummy_monitor --force
>>>        sleep 40

4. try to stop the cluster globally

    pcs cluster stop --all


Actual results:
The global cluster stop operation finishes with an error due to a timeout in pcsd


Expected results:
Cluster stop should finish successfully

Comment 1 Tomas Jelinek 2020-05-12 10:50:54 UTC

While I was working and commenting on the original bz1833506, Damien created this one. Since there is a discussion in the original bz, I'm closing this one as a duplicate.

*** This bug has been marked as a duplicate of bug 1833506 ***