Bug 1833506
Summary: | pcs cluster stop --all throws errors and doesn't seem to honor the request-timeout option | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Pavan <pkesavar> | ||||||
Component: | pcs | Assignee: | Tomas Jelinek <tojeline> | ||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 8.2 | CC: | aschultz, cfeist, cluster-maint, dciabrin, idevat, jbadiapa, jschluet, lmiccini, mlisik, mpospisi, nhostako, omular, pkomarov, tojeline | ||||||
Target Milestone: | rc | Keywords: | Regression, ZStream | ||||||
Target Release: | 8.2 | Flags: | pm-rhel:
mirror+
|
||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | pcs-0.10.6-1.el8 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Cause:
Communication between pcsd components times out if there is no response in 30 seconds.
Consequence:
Actions performed by pcsd taking longer than 30 seconds never succeed.
Fix:
Remove the timeout for internal pcsd communication.
Result:
Actions taking a long time work properly.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1838084 (view as bug list) | Environment: | |||||||
Last Closed: | 2020-11-04 02:28:18 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1837038, 1838084 | ||||||||
Attachments: |
|
Comment 3
Luca Miccini
2020-05-11 15:09:56 UTC
We had a look with Luca this morning and we think this is an error in pcs shipped in RHEL8.2 (pcs-0.10.4-6.el8.x86_64). This does not reproduce with pcs shipped in RHEL8.1 (pcs-0.10.2-4.el8.x86_64). We are not sure yet if it's a default configuration change or if it's a regression. When running "pcs cluster stop --all", internally pcs connect to the pcsd servers on all nodes to forward the stop command to all nodes. This logs an unexpected error at the command line...: overcloud-controller-2: Error connecting to overcloud-controller-2 - (HTTP error: 500) overcloud-controller-1: Error connecting to overcloud-controller-1 - (HTTP error: 500) overcloud-controller-0: Error connecting to overcloud-controller-0 - (HTTP error: 500) ... and the real error is detailed in the pcsd log: E, [2020-05-12T06:46:44.103 #00000] ERROR -- : Cannot connect to ruby daemon (message: 'HTTP 599: Empty reply from server'). Is it running? E, [2020-05-12T06:46:44.104 #00000] ERROR -- : 500 POST /remote/cluster_stop (172.17.0.42) 30002.90ms When dowgrading all cluster nodes to pcs 0.10.2-4, the same error doesn't reproduce. Note: We also observe that stopping a cluster locally with "pcs cluster stop" does not produce the error, so regular minor update operation are not impacted. We're going to try to reproduce on an empty pacemaker cluster, and if successful we'll file a RHEL bug for tracking the issue. I confirm this is a pcs/pcsd issue and the two lines from pcsd.log describe it. It cannot happen with RHEL 8.1 packages, because the ruby daemon in question has been added in 8.2 packages (bz1783106). Also, it cannot happen when stopping a cluster locally with "pcs cluster stop", because there is no pcsd involved (as long as the pcs command is run by root). It is expected increasing --request-timeout has no effect in this case: connections to pcsd running on cluster nodes were established successfully and responses were received in less than a minute, which is the default timeout. Connections between the two local pcs daemons (python to ruby) are not affected by the --request-timeout value. So far, I haven't been able to reproduce this. We will investigate this further. Does this happen all the time or just randomly here and there? We have a reliable reproducer. This issue happens every time when stopping a cluster on a node takes more than 30 seconds. Reproducer: # pcs resource create test delay startdelay=1 stopdelay=35 # pcs cluster stop --all Fixed reproducer: # pcs resource create test delay startdelay=1 stopdelay=35 op stop timeout=40 # pcs cluster stop --all The timeout must be longer than the stopdelay, otherwise the node with the resource gets fenced (which is a correct course of action, since the resource didn't stop in time). *** Bug 1834696 has been marked as a duplicate of this bug. *** Created attachment 1688511 [details] proposed fix Reproducer / test in comment 7 *** Bug 1837038 has been marked as a duplicate of this bug. *** Test: [root@r8-node-01 pcs]# rpm -q pcs pcs-0.10.6-1.el8.x86_64 [root@r8-node-01 pcs]# pcs resource create test ocf:heartbeat:Delay startdelay=1 stopdelay=35 op stop timeout=40 [root@r8-node-01 pcs]# pcs resource * test (ocf::heartbeat:Delay): Started r8-node-01 [root@r8-node-01 pcs]# time pcs cluster stop --all r8-node-02: Stopping Cluster (pacemaker)... r8-node-01: Stopping Cluster (pacemaker)... r8-node-02: Stopping Cluster (corosync)... r8-node-01: Stopping Cluster (corosync)... real 0m39.025s user 0m0.575s sys 0m0.065s Created attachment 1698193 [details]
bzaf_auto_verification_output_06202020064418
Spec executed successfully
Verifying bug as VERIFIED VERIFIED
Verification output is attached to comment
Generated by bzaf 0.0.1.dev49
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pcs bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4617 |