1834696 – new pcs' shorter operation timeout can make global cluster stop fail

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1834696 - new pcs' shorter operation timeout can make global cluster stop fail

Summary: new pcs' shorter operation timeout can make global cluster stop fail

Keywords:
Status:	CLOSED DUPLICATE of bug 1833506
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-12 09:13 UTC by Damien Ciabrini
Modified:	2022-10-11 15:54 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-12 10:50:54 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-42973	0	None	None	None	2022-10-11 15:54:00 UTC

Description Damien Ciabrini 2020-05-12 09:13:14 UTC

Description of problem:
As reported in [1], when we try to stop our entire Openstack cluster
with "pcs cluster stop --all" with pcs 0.10.4 on RHEL 8.2, the cluster
is stopped but pcs returns an error at the command line:

[root@rhel1 ~]# pcs cluster stop --all
rhel2: Stopping Cluster (pacemaker)...
rhel1: Error connecting to rhel1 - (HTTP error: 500)
rhel3: Error connecting to rhel3 - (HTTP error: 500)
Error: unable to stop all nodes
rhel1: Error connecting to rhel1 - (HTTP error: 500)
rhel3: Error connecting to rhel3 - (HTTP error: 500)
rhel1: Not stopping cluster - node is unreachable
rhel3: Not stopping cluster - node is unreachable
rhel2: Stopping Cluster (corosync)...
Error: unable to stop all nodes

The http error is detailed in pcsd logs:

[root@rhel1 ~]# tail -f /var/log/pcsd/pcsd.log 
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : SRWT Node: rhel3 Request: get_configs
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : SRWT Node: rhel2 Request: get_configs
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Connecting to: https://rhel3:2224/remote/get_configs?cluster_name=ratester
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Connecting to: https://rhel2:2224/remote/get_configs?cluster_name=ratester
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Connecting to: https://rhel1:2224/remote/get_configs?cluster_name=ratester
I, [2020-05-12T04:50:19.041 #00017]     INFO -- : Config files sync finished
I, [2020-05-12T04:50:37.364 #00000]     INFO -- : 200 GET /remote/get_configs?cluster_name=ratester (172.22.38.234) 3.27ms
I, [2020-05-12T04:50:55.454 #00000]     INFO -- : 200 GET /remote/get_configs?cluster_name=ratester (172.22.38.235) 3.44ms
E, [2020-05-12T04:51:18.333 #00000]    ERROR -- : Cannot connect to ruby daemon (message: 'HTTP 599: Empty reply from server'). Is it running?
E, [2020-05-12T04:51:18.335 #00000]    ERROR -- : 500 POST /remote/cluster_stop (172.22.38.233) 30004.61ms

It looks like if the cluster operation takes more than 30s, pcsd will return an error.

This is a change compared to the previous behaviour from e.g. RHEL8.1, and it prevents us from stopping our clusters globally (as it is fairly big) in one single command.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1833506

Version-Release number of selected component (if applicable):
pcs-0.10.4-6.el8.x86_64


How reproducible:
Always


Steps to Reproduce:
1. create a empty cluster <rhel1,rhel2,rhel3>:

    pcs cluster setup ratester --force rhel1 rhel2 rhel3
    pcs cluster start --all
    pcs property set stonith-enabled=false

2. create a dummy resource and force its stop timeout to be 100s

    pcs resource create dummy ocf:pacemaker:Dummy --disabled
    pcs resource update dummy meta op stop timeout=100s
    pcs resource enable dummy

3. change file /usr/lib/ocf/resource.d/pacemaker/Dummy on all nodes to add a 40s delay in the stop operation:
    dummy_stop() {
        dummy_monitor --force
>>>        sleep 40

4. try to stop the cluster globally

    pcs cluster stop --all


Actual results:
The global cluster stop operation finishes with an error due to a timeout in pcsd


Expected results:
Cluster stop should finish successfully

Comment 1 Tomas Jelinek 2020-05-12 10:50:54 UTC

While I was working and commenting on the original bz1833506, Damien created this one. Since there is a discussion in the original bz, I'm closing this one as a duplicate.

*** This bug has been marked as a duplicate of bug 1833506 ***

Note You need to log in before you can comment on or make changes to this bug.