Bug 1948620

Summary:	crm_mon not able to detect resource-status when used in stop-operation of a Resource Agent during node shutdown
Product:	Red Hat Enterprise Linux 8	Reporter:	Klaus Wenninger <kwenning>
Component:	pacemaker	Assignee:	Klaus Wenninger <kwenning>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:	Steven J. Levine <slevine>
Priority:	urgent
Version:	8.4	CC:	cfeist, cluster-maint, huali, kgaillot, lmiksik, msmazova, nwahl, phagara, sbradley, slevine, tnagata
Target Milestone:	rc	Keywords:	Regression, Triaged, ZStream
Target Release:	8.5	Flags:	pm-rhel: mirror+
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	pacemaker-2.1.0-1.el8	Doc Type:	Bug Fix
Doc Text:	.The `ocf:heartbeat:pgsql` resource agent and some third-party agents no longer fail to stop during a shutdown process In the RHEL 8.4 GA release, Pacemaker's `crm_mon` command-line tool was modified to display a "shutting down" message rather than the usual cluster information when Pacemaker starts to shut down. As a consequence, shutdown progress, such as the stopping of resources, could not be monitored. In this situation, resource agents that parse `crm_mon` output in their stop operation (such as the `ocf:heartbeat:pgsql` agent distributed with the resource-agents package, or some custom or third-party agents) could fail to stop, leading to cluster problems. This bug has been fixed, and the described problem no longer occurs.	Story Points:	---
Clone Of:
Clones:	1951098 (view as bug list)		Environment:
Last Closed:	2021-11-09 18:44:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:	2.1.0
Embargoed:
Bug Depends On:
Bug Blocks:	1951098

Description Klaus Wenninger 2021-04-12 16:02:02 UTC

Description of problem:

Due to checking the state pacemakerd is running in crm_mon isn't trying to connect to CIB anymore if pacemakerd is already shutting down the subdaemons

Version-Release number of selected component (if applicable):

2.0.5-8.el8

How reproducible:

100%

Steps to Reproduce:
1. Setup a cluster with at least 1 node and 1 resource. Resource should take a significant time to be shut down when running on a node.
2. Open 3 terminals going to the node where the sluggish resource is running
3. Start crm_mon in interactive-mode in one terminal
4. issue 'systemctl stop pacemaker' in the 2nd terminal
5. Once you see that that the sluggish resource is shutting down in the 1st terminal issue 'crm_mon -1'

Actual results:

crm_mon should give you 'Pacemaker daemons shutting down ...' to indicate that it isn't sure that CIB is accessible anymore due to shutdown and thus doesn't try.

Expected results:

crm_mon will still try to contact CIB and give you a result as all the daemons should still be running because the sluggish resource is delaying shutdown.

Additional info:

If additional info shown by crm_mon (depending on the mode it is used interactive vs. e.g. one-shot) is desirable is still under discussion.
Either some thing like 'Pacemaker daemons shutting down ... still trying to reach CIB' or 'Pacemaker is evacuating/shutting down resources from/on node' or ...

Comment 1 Ken Gaillot 2021-04-12 16:09:31 UTC

Fixed upstream by commit 49ebe4c

Note that this causes the ocf:heartbeat:pgsql agent to fail to stop when the cluster is shutting down, because it parses the output of crm_mon, which is why the severity is elevated and a z-stream is requested. No other agents distributed with RHEL are known to be affected, but third-party and custom agents could potentially be affected as well.

Comment 2 Ken Gaillot 2021-04-12 16:10:52 UTC

Also, the issue was introduced in 8.4, so no earlier z-streams are necessary

Comment 3 Klaus Wenninger 2021-04-12 16:16:43 UTC

See as well upstream-bug for the issue https://bugs.clusterlabs.org/show_bug.cgi?id=5471

Comment 4 Klaus Wenninger 2021-04-13 11:45:05 UTC

(In reply to Ken Gaillot from comment #1)
> Fixed upstream by commit 49ebe4c
> 

That fix doesn't catch cases where the Resource Agent is using xml-output of crm_mon.
Upstream Pull Request #2343 should add same behavior for xml-output.

Comment 5 Ken Gaillot 2021-04-14 14:14:12 UTC

(In reply to Klaus Wenninger from comment #4)
> (In reply to Ken Gaillot from comment #1)
> > Fixed upstream by commit 49ebe4c
> > 
> 
> That fix doesn't catch cases where the Resource Agent is using xml-output of
> crm_mon.
> Upstream Pull Request #2343 should add same behavior for xml-output.

46d6edd has now been merged upstream to complete the fix

Comment 6 Ken Gaillot 2021-04-14 17:23:32 UTC

QA: In addition to the reproducer in the description, crm_mon should also be run with the --output-as=xml option to verify that XML output is also fixed.

Comment 36 Markéta Smazová 2021-06-15 14:21:49 UTC

before fix
-----------

>   [root@virt-044 ~]# rpm -q pacemaker
>   pacemaker-2.0.5-9.el8.x86_64

Setup a cluster with a resource that takes long enough to stop.

>   [root@virt-044 ~]# pcs status
>   Cluster name: STSRHTS5688
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-044 (version 2.0.5-9.el8-ba59be7122) - partition with quorum
>     * Last updated: Tue Jun 15 15:12:05 2021
>     * Last change:  Tue Jun 15 14:34:33 2021 by root via cibadmin on virt-044
>     * 2 nodes configured
>     * 3 resource instances configured

>   Node List:
>     * Online: [ virt-044 virt-045 ]

>   Full List of Resources:
>     * fence-virt-044	(stonith:fence_xvm):	 Started virt-044
>     * fence-virt-045	(stonith:fence_xvm):	 Started virt-045
>     * dummy	(ocf::pacemaker:Dummy):	 Started virt-044

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled
>   [root@virt-044 ~]# pcs resource config | grep Dummy -A1
>    Resource: dummy (class=ocf provider=pacemaker type=Dummy)
>     Attributes: op_sleep=10

Stop the cluster and run `pcs status` while the resource is stopping. 

>   [root@virt-044 ~]# pcs cluster stop --all &>/dev/null & sleep 5; pcs status
>   [1] 498489
>   Error: error running crm_mon, is pacemaker running?
>     crm_mon: Error: cluster is not available on this node
>     Pacemaker daemons shutting down ...

Status shows only `Pacemaker daemons shutting down` and doesn't show resource stopping.

Start the cluster, wait for resource `dummy` to start, then stop the cluster again and test XML output.

>   [root@virt-044 ~]# pcs cluster start --all
>   virt-045: Starting Cluster...
>   virt-044: Starting Cluster...
>   [1]+  Done

>   [root@virt-044 ~]# pcs status
>   Cluster name: STSRHTS5688
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-044 (version 2.0.5-9.el8-ba59be7122) - partition with quorum
>     * Last updated: Tue Jun 15 15:18:02 2021
>     * Last change:  Tue Jun 15 14:45:42 2021 by root via cibadmin on virt-044
>     * 2 nodes configured
>     * 3 resource instances configured

>   Node List:
>     * Online: [ virt-044 virt-045 ]

>   Full List of Resources:
>     * fence-virt-044	(stonith:fence_xvm):	 Started virt-044
>     * fence-virt-045	(stonith:fence_xvm):	 Started virt-045
>     * dummy	(ocf::pacemaker:Dummy):	 Started virt-044

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled

>   [root@virt-044 ~]# pcs cluster stop --all &>/dev/null & sleep 5; crm_mon -1 --output-as=xml
>   [1] 498771
>   <pacemaker-result api-version="2.3" request="crm_mon -1 --output-as=xml">
>     <status code="102" message="Not connected">
>       <errors>
>         <error>crm_mon: Error: cluster is not available on this node</error>
>       </errors>
>     </status>
>   </pacemaker-result>

XML output shows error and doesn't show resource stopping.



after fix
----------

>   [root@virt-539 ~]# rpm -q pacemaker
>   pacemaker-2.1.0-2.el8.x86_64

Setup a cluster with a resource that takes long enough to stop.

>   [root@virt-539 ~]# pcs status
>   Cluster name: STSRHTS20356
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-548 (version 2.1.0-2.el8-7c3f660707) - partition with quorum
>     * Last updated: Tue Jun 15 15:13:55 2021
>     * Last change:  Tue Jun 15 14:26:48 2021 by root via cibadmin on virt-539
>     * 2 nodes configured
>     * 3 resource instances configured

>   Node List:
>     * Online: [ virt-539 virt-548 ]

>   Full List of Resources:
>     * fence-virt-539	(stonith:fence_xvm):	 Started virt-539
>     * fence-virt-548	(stonith:fence_xvm):	 Started virt-548
>     * dummy	(ocf::pacemaker:Dummy):	 Started virt-539

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled
>   [root@virt-539 ~]# pcs resource config | grep Dummy -A1
>    Resource: dummy (class=ocf provider=pacemaker type=Dummy)
>     Attributes: op_sleep=10

Stop the cluster and run `pcs status` while the resource is stopping. 

>   [root@virt-539 ~]# pcs cluster stop --all &>/dev/null & sleep 5; pcs status
>   [1] 142374
>   Cluster name: STSRHTS20356
>   Pacemaker daemons shutting down ...
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-548 (version 2.1.0-2.el8-7c3f660707) - partition with quorum
>     * Last updated: Tue Jun 15 15:14:24 2021
>     * Last change:  Tue Jun 15 14:26:48 2021 by root via cibadmin on virt-539
>     * 2 nodes configured
>     * 3 resource instances configured

>   Node List:
>     * Online: [ virt-539 virt-548 ]

>   Full List of Resources:
>     * fence-virt-539	(stonith:fence_xvm):	 Started virt-548
>     * fence-virt-548	(stonith:fence_xvm):	 Started virt-548
>     * dummy	(ocf::pacemaker:Dummy):	 Stopping virt-539

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: inactive/disabled
>     pcsd: active/enabled

Status shows `Pacemaker daemons shutting down` and also shows resource stopping.

Start the cluster, wait for resource `dummy` to start, then stop the cluster again and test XML output.

>   [root@virt-539 ~]# pcs cluster start --all
>   virt-539: Starting Cluster...
>   virt-548: Starting Cluster...

>   [root@virt-539 ~]# pcs status
>   Cluster name: STSRHTS20356
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-548 (version 2.1.0-2.el8-7c3f660707) - partition with quorum
>     * Last updated: Tue Jun 15 15:21:07 2021
>     * Last change:  Tue Jun 15 15:15:59 2021 by root via cibadmin on virt-539
>     * 2 nodes configured
>     * 3 resource instances configured

>   Node List:
>     * Online: [ virt-539 virt-548 ]

>   Full List of Resources:
>     * fence-virt-539	(stonith:fence_xvm):	 Started virt-539
>     * fence-virt-548	(stonith:fence_xvm):	 Started virt-548
>     * dummy	(ocf::pacemaker:Dummy):	 Started virt-539

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled

>   [root@virt-539 ~]# pcs cluster stop --all &>/dev/null & sleep 5; crm_mon -1 --output-as=xml
>   [1] 143514
>   <pacemaker-result api-version="2.9" request="crm_mon -1 --output-as=xml">
>     <summary>
>       <stack type="corosync"/>
>       <current_dc present="true" version="2.1.0-2.el8-7c3f660707" name="virt-548" id="2" with_quorum="true"/>
>       <last_update time="Tue Jun 15 15:26:28 2021"/>
>       <last_change time="Tue Jun 15 15:15:59 2021" user="root" client="cibadmin" origin="virt-539"/>
>       <nodes_configured number="2"/>
>       <resources_configured number="3" disabled="0" blocked="0"/>
>       <cluster_options stonith-enabled="true" symmetric-cluster="true" no-quorum-policy="stop" maintenance-mode="false" stop-all-resources="false"/>
>     </summary>
>     <nodes>
>       <node name="virt-539" id="1" online="true" standby="false" standby_onfail="false" maintenance="false" pending="false" unclean="false" shutdown="true" expected_up="false" is_dc="false" resources_running="1" type="member"/>
>       <node name="virt-548" id="2" online="true" standby="false" standby_onfail="false" maintenance="false" pending="false" unclean="false" shutdown="true" expected_up="false" is_dc="true" resources_running="1" type="member"/>
>     </nodes>
>     <resources>
>       <resource id="fence-virt-539" resource_agent="stonith:fence_xvm" role="Stopped" active="false" orphaned="false" blocked="false" managed="true" failed="false" failure_ignored="false" nodes_running_on="0"/>
>       <resource id="fence-virt-548" resource_agent="stonith:fence_xvm" role="Started" active="true" orphaned="false" blocked="false" managed="true" failed="false" failure_ignored="false" nodes_running_on="1">
>         <node name="virt-548" id="2" cached="true"/>
>       </resource>
>       <resource id="dummy" resource_agent="ocf::pacemaker:Dummy" role="Stopping" active="true" orphaned="false" blocked="false" managed="true" failed="false" failure_ignored="false" nodes_running_on="1">
>         <node name="virt-539" id="1" cached="true"/>
>       </resource>
>     </resources>
>     <node_history>
>       <node name="virt-548">
>         <resource_history id="fence-virt-548" orphan="false" migration-threshold="1000000">
>           <operation_history call="14" task="start" rc="0" rc_text="ok" last-rc-change="Tue Jun 15 15:19:31 2021" exec-time="49ms" queue-time="0ms"/>
>           <operation_history call="15" task="monitor" rc="0" rc_text="ok" interval="60000ms" last-rc-change="Tue Jun 15 15:19:31 2021" exec-time="59ms" queue-time="0ms"/>
>         </resource_history>
>       </node>
>       <node name="virt-539">
>         <resource_history id="fence-virt-539" orphan="false" migration-threshold="1000000">
>           <operation_history call="15" task="monitor" rc="0" rc_text="ok" interval="60000ms" last-rc-change="Tue Jun 15 15:19:31 2021" exec-time="41ms" queue-time="0ms"/>
>           <operation_history call="19" task="stop" rc="0" rc_text="ok" last-rc-change="Tue Jun 15 15:26:24 2021" exec-time="0ms" queue-time="0ms"/>
>         </resource_history>
>         <resource_history id="dummy" orphan="false" migration-threshold="1000000">
>           <operation_history call="17" task="monitor" rc="0" rc_text="ok" interval="10000ms" last-rc-change="Tue Jun 15 15:19:52 2021" exec-time="10028ms" queue-time="0ms"/>
>           <operation_history call="-1" task="stop" rc="193" rc_text="unknown" last-rc-change="Tue Jun 15 15:26:24 2021" exec-time="0ms" queue-time="0ms"/>
>         </resource_history>
>       </node>
>     </node_history>
>     <status code="0" message="OK"/>
>   </pacemaker-result>

XML output shows cluster shutting down and resource stopping.


marking verified in pacemaker-2.1.0-2.el8

Comment 44 errata-xmlrpc 2021-11-09 18:44:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:4267