Bug 1948620
Summary: | crm_mon not able to detect resource-status when used in stop-operation of a Resource Agent during node shutdown | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Klaus Wenninger <kwenning> | |
Component: | pacemaker | Assignee: | Klaus Wenninger <kwenning> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | high | Docs Contact: | Steven J. Levine <slevine> | |
Priority: | urgent | |||
Version: | 8.4 | CC: | cfeist, cluster-maint, huali, kgaillot, lmiksik, msmazova, nwahl, phagara, sbradley, slevine, tnagata | |
Target Milestone: | rc | Keywords: | Regression, Triaged, ZStream | |
Target Release: | 8.5 | |||
Hardware: | All | |||
OS: | All | |||
Whiteboard: | ||||
Fixed In Version: | pacemaker-2.1.0-1.el8 | Doc Type: | Bug Fix | |
Doc Text: |
.The `ocf:heartbeat:pgsql` resource agent and some third-party agents no longer fail to stop during a shutdown process
In the RHEL 8.4 GA release, Pacemaker's `crm_mon` command-line tool was modified to display a "shutting down" message rather than the usual cluster information when Pacemaker starts to shut down. As a consequence, shutdown progress, such as the stopping of resources, could not be monitored. In this situation, resource agents that parse `crm_mon` output in their stop operation (such as the `ocf:heartbeat:pgsql` agent distributed with the resource-agents package, or some custom or third-party agents) could fail to stop, leading to cluster problems. This bug has been fixed, and the described problem no longer occurs.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1951098 (view as bug list) | Environment: | ||
Last Closed: | 2021-11-09 18:44:49 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | 2.1.0 | |
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1951098 |
Description
Klaus Wenninger
2021-04-12 16:02:02 UTC
Fixed upstream by commit 49ebe4c Note that this causes the ocf:heartbeat:pgsql agent to fail to stop when the cluster is shutting down, because it parses the output of crm_mon, which is why the severity is elevated and a z-stream is requested. No other agents distributed with RHEL are known to be affected, but third-party and custom agents could potentially be affected as well. Also, the issue was introduced in 8.4, so no earlier z-streams are necessary See as well upstream-bug for the issue https://bugs.clusterlabs.org/show_bug.cgi?id=5471 (In reply to Ken Gaillot from comment #1) > Fixed upstream by commit 49ebe4c > That fix doesn't catch cases where the Resource Agent is using xml-output of crm_mon. Upstream Pull Request #2343 should add same behavior for xml-output. (In reply to Klaus Wenninger from comment #4) > (In reply to Ken Gaillot from comment #1) > > Fixed upstream by commit 49ebe4c > > > > That fix doesn't catch cases where the Resource Agent is using xml-output of > crm_mon. > Upstream Pull Request #2343 should add same behavior for xml-output. 46d6edd has now been merged upstream to complete the fix QA: In addition to the reproducer in the description, crm_mon should also be run with the --output-as=xml option to verify that XML output is also fixed. before fix ----------- > [root@virt-044 ~]# rpm -q pacemaker > pacemaker-2.0.5-9.el8.x86_64 Setup a cluster with a resource that takes long enough to stop. > [root@virt-044 ~]# pcs status > Cluster name: STSRHTS5688 > Cluster Summary: > * Stack: corosync > * Current DC: virt-044 (version 2.0.5-9.el8-ba59be7122) - partition with quorum > * Last updated: Tue Jun 15 15:12:05 2021 > * Last change: Tue Jun 15 14:34:33 2021 by root via cibadmin on virt-044 > * 2 nodes configured > * 3 resource instances configured > Node List: > * Online: [ virt-044 virt-045 ] > Full List of Resources: > * fence-virt-044 (stonith:fence_xvm): Started virt-044 > * fence-virt-045 (stonith:fence_xvm): Started virt-045 > * dummy (ocf::pacemaker:Dummy): Started virt-044 > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > [root@virt-044 ~]# pcs resource config | grep Dummy -A1 > Resource: dummy (class=ocf provider=pacemaker type=Dummy) > Attributes: op_sleep=10 Stop the cluster and run `pcs status` while the resource is stopping. > [root@virt-044 ~]# pcs cluster stop --all &>/dev/null & sleep 5; pcs status > [1] 498489 > Error: error running crm_mon, is pacemaker running? > crm_mon: Error: cluster is not available on this node > Pacemaker daemons shutting down ... Status shows only `Pacemaker daemons shutting down` and doesn't show resource stopping. Start the cluster, wait for resource `dummy` to start, then stop the cluster again and test XML output. > [root@virt-044 ~]# pcs cluster start --all > virt-045: Starting Cluster... > virt-044: Starting Cluster... > [1]+ Done > [root@virt-044 ~]# pcs status > Cluster name: STSRHTS5688 > Cluster Summary: > * Stack: corosync > * Current DC: virt-044 (version 2.0.5-9.el8-ba59be7122) - partition with quorum > * Last updated: Tue Jun 15 15:18:02 2021 > * Last change: Tue Jun 15 14:45:42 2021 by root via cibadmin on virt-044 > * 2 nodes configured > * 3 resource instances configured > Node List: > * Online: [ virt-044 virt-045 ] > Full List of Resources: > * fence-virt-044 (stonith:fence_xvm): Started virt-044 > * fence-virt-045 (stonith:fence_xvm): Started virt-045 > * dummy (ocf::pacemaker:Dummy): Started virt-044 > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > [root@virt-044 ~]# pcs cluster stop --all &>/dev/null & sleep 5; crm_mon -1 --output-as=xml > [1] 498771 > <pacemaker-result api-version="2.3" request="crm_mon -1 --output-as=xml"> > <status code="102" message="Not connected"> > <errors> > <error>crm_mon: Error: cluster is not available on this node</error> > </errors> > </status> > </pacemaker-result> XML output shows error and doesn't show resource stopping. after fix ---------- > [root@virt-539 ~]# rpm -q pacemaker > pacemaker-2.1.0-2.el8.x86_64 Setup a cluster with a resource that takes long enough to stop. > [root@virt-539 ~]# pcs status > Cluster name: STSRHTS20356 > Cluster Summary: > * Stack: corosync > * Current DC: virt-548 (version 2.1.0-2.el8-7c3f660707) - partition with quorum > * Last updated: Tue Jun 15 15:13:55 2021 > * Last change: Tue Jun 15 14:26:48 2021 by root via cibadmin on virt-539 > * 2 nodes configured > * 3 resource instances configured > Node List: > * Online: [ virt-539 virt-548 ] > Full List of Resources: > * fence-virt-539 (stonith:fence_xvm): Started virt-539 > * fence-virt-548 (stonith:fence_xvm): Started virt-548 > * dummy (ocf::pacemaker:Dummy): Started virt-539 > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > [root@virt-539 ~]# pcs resource config | grep Dummy -A1 > Resource: dummy (class=ocf provider=pacemaker type=Dummy) > Attributes: op_sleep=10 Stop the cluster and run `pcs status` while the resource is stopping. > [root@virt-539 ~]# pcs cluster stop --all &>/dev/null & sleep 5; pcs status > [1] 142374 > Cluster name: STSRHTS20356 > Pacemaker daemons shutting down ... > Cluster Summary: > * Stack: corosync > * Current DC: virt-548 (version 2.1.0-2.el8-7c3f660707) - partition with quorum > * Last updated: Tue Jun 15 15:14:24 2021 > * Last change: Tue Jun 15 14:26:48 2021 by root via cibadmin on virt-539 > * 2 nodes configured > * 3 resource instances configured > Node List: > * Online: [ virt-539 virt-548 ] > Full List of Resources: > * fence-virt-539 (stonith:fence_xvm): Started virt-548 > * fence-virt-548 (stonith:fence_xvm): Started virt-548 > * dummy (ocf::pacemaker:Dummy): Stopping virt-539 > Daemon Status: > corosync: active/disabled > pacemaker: inactive/disabled > pcsd: active/enabled Status shows `Pacemaker daemons shutting down` and also shows resource stopping. Start the cluster, wait for resource `dummy` to start, then stop the cluster again and test XML output. > [root@virt-539 ~]# pcs cluster start --all > virt-539: Starting Cluster... > virt-548: Starting Cluster... > [root@virt-539 ~]# pcs status > Cluster name: STSRHTS20356 > Cluster Summary: > * Stack: corosync > * Current DC: virt-548 (version 2.1.0-2.el8-7c3f660707) - partition with quorum > * Last updated: Tue Jun 15 15:21:07 2021 > * Last change: Tue Jun 15 15:15:59 2021 by root via cibadmin on virt-539 > * 2 nodes configured > * 3 resource instances configured > Node List: > * Online: [ virt-539 virt-548 ] > Full List of Resources: > * fence-virt-539 (stonith:fence_xvm): Started virt-539 > * fence-virt-548 (stonith:fence_xvm): Started virt-548 > * dummy (ocf::pacemaker:Dummy): Started virt-539 > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > [root@virt-539 ~]# pcs cluster stop --all &>/dev/null & sleep 5; crm_mon -1 --output-as=xml > [1] 143514 > <pacemaker-result api-version="2.9" request="crm_mon -1 --output-as=xml"> > <summary> > <stack type="corosync"/> > <current_dc present="true" version="2.1.0-2.el8-7c3f660707" name="virt-548" id="2" with_quorum="true"/> > <last_update time="Tue Jun 15 15:26:28 2021"/> > <last_change time="Tue Jun 15 15:15:59 2021" user="root" client="cibadmin" origin="virt-539"/> > <nodes_configured number="2"/> > <resources_configured number="3" disabled="0" blocked="0"/> > <cluster_options stonith-enabled="true" symmetric-cluster="true" no-quorum-policy="stop" maintenance-mode="false" stop-all-resources="false"/> > </summary> > <nodes> > <node name="virt-539" id="1" online="true" standby="false" standby_onfail="false" maintenance="false" pending="false" unclean="false" shutdown="true" expected_up="false" is_dc="false" resources_running="1" type="member"/> > <node name="virt-548" id="2" online="true" standby="false" standby_onfail="false" maintenance="false" pending="false" unclean="false" shutdown="true" expected_up="false" is_dc="true" resources_running="1" type="member"/> > </nodes> > <resources> > <resource id="fence-virt-539" resource_agent="stonith:fence_xvm" role="Stopped" active="false" orphaned="false" blocked="false" managed="true" failed="false" failure_ignored="false" nodes_running_on="0"/> > <resource id="fence-virt-548" resource_agent="stonith:fence_xvm" role="Started" active="true" orphaned="false" blocked="false" managed="true" failed="false" failure_ignored="false" nodes_running_on="1"> > <node name="virt-548" id="2" cached="true"/> > </resource> > <resource id="dummy" resource_agent="ocf::pacemaker:Dummy" role="Stopping" active="true" orphaned="false" blocked="false" managed="true" failed="false" failure_ignored="false" nodes_running_on="1"> > <node name="virt-539" id="1" cached="true"/> > </resource> > </resources> > <node_history> > <node name="virt-548"> > <resource_history id="fence-virt-548" orphan="false" migration-threshold="1000000"> > <operation_history call="14" task="start" rc="0" rc_text="ok" last-rc-change="Tue Jun 15 15:19:31 2021" exec-time="49ms" queue-time="0ms"/> > <operation_history call="15" task="monitor" rc="0" rc_text="ok" interval="60000ms" last-rc-change="Tue Jun 15 15:19:31 2021" exec-time="59ms" queue-time="0ms"/> > </resource_history> > </node> > <node name="virt-539"> > <resource_history id="fence-virt-539" orphan="false" migration-threshold="1000000"> > <operation_history call="15" task="monitor" rc="0" rc_text="ok" interval="60000ms" last-rc-change="Tue Jun 15 15:19:31 2021" exec-time="41ms" queue-time="0ms"/> > <operation_history call="19" task="stop" rc="0" rc_text="ok" last-rc-change="Tue Jun 15 15:26:24 2021" exec-time="0ms" queue-time="0ms"/> > </resource_history> > <resource_history id="dummy" orphan="false" migration-threshold="1000000"> > <operation_history call="17" task="monitor" rc="0" rc_text="ok" interval="10000ms" last-rc-change="Tue Jun 15 15:19:52 2021" exec-time="10028ms" queue-time="0ms"/> > <operation_history call="-1" task="stop" rc="193" rc_text="unknown" last-rc-change="Tue Jun 15 15:26:24 2021" exec-time="0ms" queue-time="0ms"/> > </resource_history> > </node> > </node_history> > <status code="0" message="OK"/> > </pacemaker-result> XML output shows cluster shutting down and resource stopping. marking verified in pacemaker-2.1.0-2.el8 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:4267 |