1948620 – crm_mon not able to detect resource-status when used in stop-operation of a Resource Agent during node shutdown

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1948620 - crm_mon not able to detect resource-status when used in stop-operation of a Resource Agent during node shutdown

Summary: crm_mon not able to detect resource-status when used in stop-operation of a R...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.4
Hardware:	All
OS:	All
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	8.5
Assignee:	Klaus Wenninger
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:	1951098
TreeView+	depends on / blocked

Reported:	2021-04-12 16:02 UTC by Klaus Wenninger
Modified:	2024-06-14 01:12 UTC (History)
CC List:	11 users (show)
Fixed In Version:	pacemaker-2.1.0-1.el8
Doc Type:	Bug Fix
Doc Text:	.The `ocf:heartbeat:pgsql` resource agent and some third-party agents no longer fail to stop during a shutdown process In the RHEL 8.4 GA release, Pacemaker's `crm_mon` command-line tool was modified to display a "shutting down" message rather than the usual cluster information when Pacemaker starts to shut down. As a consequence, shutdown progress, such as the stopping of resources, could not be monitored. In this situation, resource agents that parse `crm_mon` output in their stop operation (such as the `ocf:heartbeat:pgsql` agent distributed with the resource-agents package, or some custom or third-party agents) could fail to stop, leading to cluster problems. This bug has been fixed, and the described problem no longer occurs.
Clone Of:
Clones:	1951098 (view as bug list)
Environment:
Last Closed:	2021-11-09 18:44:49 UTC
Type:	Bug
Target Upstream Version:	2.1.0
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Cluster Labs	5471	None	None	None	2021-04-12 18:37:13 UTC
Red Hat Knowledge Base (Solution)	6034771	None	None	None	2021-05-12 02:01:29 UTC
Red Hat Product Errata	RHEA-2021:4267	None	None	None	2021-11-09 18:45:25 UTC

Description Klaus Wenninger 2021-04-12 16:02:02 UTC

Description of problem:

Due to checking the state pacemakerd is running in crm_mon isn't trying to connect to CIB anymore if pacemakerd is already shutting down the subdaemons

Version-Release number of selected component (if applicable):

2.0.5-8.el8

How reproducible:

100%

Steps to Reproduce:
1. Setup a cluster with at least 1 node and 1 resource. Resource should take a significant time to be shut down when running on a node.
2. Open 3 terminals going to the node where the sluggish resource is running
3. Start crm_mon in interactive-mode in one terminal
4. issue 'systemctl stop pacemaker' in the 2nd terminal
5. Once you see that that the sluggish resource is shutting down in the 1st terminal issue 'crm_mon -1'

Actual results:

crm_mon should give you 'Pacemaker daemons shutting down ...' to indicate that it isn't sure that CIB is accessible anymore due to shutdown and thus doesn't try.

Expected results:

crm_mon will still try to contact CIB and give you a result as all the daemons should still be running because the sluggish resource is delaying shutdown.

Additional info:

If additional info shown by crm_mon (depending on the mode it is used interactive vs. e.g. one-shot) is desirable is still under discussion.
Either some thing like 'Pacemaker daemons shutting down ... still trying to reach CIB' or 'Pacemaker is evacuating/shutting down resources from/on node' or ...

Comment 1 Ken Gaillot 2021-04-12 16:09:31 UTC

Fixed upstream by commit 49ebe4c

Note that this causes the ocf:heartbeat:pgsql agent to fail to stop when the cluster is shutting down, because it parses the output of crm_mon, which is why the severity is elevated and a z-stream is requested. No other agents distributed with RHEL are known to be affected, but third-party and custom agents could potentially be affected as well.

Comment 2 Ken Gaillot 2021-04-12 16:10:52 UTC

Also, the issue was introduced in 8.4, so no earlier z-streams are necessary

Comment 3 Klaus Wenninger 2021-04-12 16:16:43 UTC

See as well upstream-bug for the issue https://bugs.clusterlabs.org/show_bug.cgi?id=5471

Comment 4 Klaus Wenninger 2021-04-13 11:45:05 UTC

(In reply to Ken Gaillot from comment #1)
> Fixed upstream by commit 49ebe4c
> 

That fix doesn't catch cases where the Resource Agent is using xml-output of crm_mon.
Upstream Pull Request #2343 should add same behavior for xml-output.

Comment 5 Ken Gaillot 2021-04-14 14:14:12 UTC

(In reply to Klaus Wenninger from comment #4)
> (In reply to Ken Gaillot from comment #1)
> > Fixed upstream by commit 49ebe4c
> > 
> 
> That fix doesn't catch cases where the Resource Agent is using xml-output of
> crm_mon.
> Upstream Pull Request #2343 should add same behavior for xml-output.

46d6edd has now been merged upstream to complete the fix

Comment 6 Ken Gaillot 2021-04-14 17:23:32 UTC

QA: In addition to the reproducer in the description, crm_mon should also be run with the --output-as=xml option to verify that XML output is also fixed.

Comment 36 Markéta Smazová 2021-06-15 14:21:49 UTC

before fix
-----------

>   [root@virt-044 ~]# rpm -q pacemaker
>   pacemaker-2.0.5-9.el8.x86_64

Setup a cluster with a resource that takes long enough to stop.

>   [root@virt-044 ~]# pcs status
>   Cluster name: STSRHTS5688
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-044 (version 2.0.5-9.el8-ba59be7122) - partition with quorum
>     * Last updated: Tue Jun 15 15:12:05 2021
>     * Last change:  Tue Jun 15 14:34:33 2021 by root via cibadmin on virt-044
>     * 2 nodes configured
>     * 3 resource instances configured

>   Node List:
>     * Online: [ virt-044 virt-045 ]

>   Full List of Resources:
>     * fence-virt-044	(stonith:fence_xvm):	 Started virt-044
>     * fence-virt-045	(stonith:fence_xvm):	 Started virt-045
>     * dummy	(ocf::pacemaker:Dummy):	 Started virt-044

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled
>   [root@virt-044 ~]# pcs resource config | grep Dummy -A1
>    Resource: dummy (class=ocf provider=pacemaker type=Dummy)
>     Attributes: op_sleep=10

Stop the cluster and run `pcs status` while the resource is stopping. 

>   [root@virt-044 ~]# pcs cluster stop --all &>/dev/null & sleep 5; pcs status
>   [1] 498489
>   Error: error running crm_mon, is pacemaker running?
>     crm_mon: Error: cluster is not available on this node
>     Pacemaker daemons shutting down ...

Status shows only `Pacemaker daemons shutting down` and doesn't show resource stopping.

Start the cluster, wait for resource `dummy` to start, then stop the cluster again and test XML output.

>   [root@virt-044 ~]# pcs cluster start --all
>   virt-045: Starting Cluster...
>   virt-044: Starting Cluster...
>   [1]+  Done

>   [root@virt-044 ~]# pcs status
>   Cluster name: STSRHTS5688
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-044 (version 2.0.5-9.el8-ba59be7122) - partition with quorum
>     * Last updated: Tue Jun 15 15:18:02 2021
>     * Last change:  Tue Jun 15 14:45:42 2021 by root via cibadmin on virt-044
>     * 2 nodes configured
>     * 3 resource instances configured

>   Node List:
>     * Online: [ virt-044 virt-045 ]

>   Full List of Resources:
>     * fence-virt-044	(stonith:fence_xvm):	 Started virt-044
>     * fence-virt-045	(stonith:fence_xvm):	 Started virt-045
>     * dummy	(ocf::pacemaker:Dummy):	 Started virt-044

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled

>   [root@virt-044 ~]# pcs cluster stop --all &>/dev/null & sleep 5; crm_mon -1 --output-as=xml
>   [1] 498771
>   <pacemaker-result api-version="2.3" request="crm_mon -1 --output-as=xml">
>     <status code="102" message="Not connected">
>       <errors>
>         <error>crm_mon: Error: cluster is not available on this node</error>
>       </errors>
>     </status>
>   </pacemaker-result>

XML output shows error and doesn't show resource stopping.



after fix
----------

>   [root@virt-539 ~]# rpm -q pacemaker
>   pacemaker-2.1.0-2.el8.x86_64

Setup a cluster with a resource that takes long enough to stop.

>   [root@virt-539 ~]# pcs status
>   Cluster name: STSRHTS20356
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-548 (version 2.1.0-2.el8-7c3f660707) - partition with quorum
>     * Last updated: Tue Jun 15 15:13:55 2021
>     * Last change:  Tue Jun 15 14:26:48 2021 by root via cibadmin on virt-539
>     * 2 nodes configured
>     * 3 resource instances configured

>   Node List:
>     * Online: [ virt-539 virt-548 ]

>   Full List of Resources:
>     * fence-virt-539	(stonith:fence_xvm):	 Started virt-539
>     * fence-virt-548	(stonith:fence_xvm):	 Started virt-548
>     * dummy	(ocf::pacemaker:Dummy):	 Started virt-539

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled
>   [root@virt-539 ~]# pcs resource config | grep Dummy -A1
>    Resource: dummy (class=ocf provider=pacemaker type=Dummy)
>     Attributes: op_sleep=10

Stop the cluster and run `pcs status` while the resource is stopping. 

>   [root@virt-539 ~]# pcs cluster stop --all &>/dev/null & sleep 5; pcs status
>   [1] 142374
>   Cluster name: STSRHTS20356
>   Pacemaker daemons shutting down ...
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-548 (version 2.1.0-2.el8-7c3f660707) - partition with quorum
>     * Last updated: Tue Jun 15 15:14:24 2021
>     * Last change:  Tue Jun 15 14:26:48 2021 by root via cibadmin on virt-539
>     * 2 nodes configured
>     * 3 resource instances configured

>   Node List:
>     * Online: [ virt-539 virt-548 ]

>   Full List of Resources:
>     * fence-virt-539	(stonith:fence_xvm):	 Started virt-548
>     * fence-virt-548	(stonith:fence_xvm):	 Started virt-548
>     * dummy	(ocf::pacemaker:Dummy):	 Stopping virt-539

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: inactive/disabled
>     pcsd: active/enabled

Status shows `Pacemaker daemons shutting down` and also shows resource stopping.

Start the cluster, wait for resource `dummy` to start, then stop the cluster again and test XML output.

>   [root@virt-539 ~]# pcs cluster start --all
>   virt-539: Starting Cluster...
>   virt-548: Starting Cluster...

>   [root@virt-539 ~]# pcs status
>   Cluster name: STSRHTS20356
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-548 (version 2.1.0-2.el8-7c3f660707) - partition with quorum
>     * Last updated: Tue Jun 15 15:21:07 2021
>     * Last change:  Tue Jun 15 15:15:59 2021 by root via cibadmin on virt-539
>     * 2 nodes configured
>     * 3 resource instances configured

>   Node List:
>     * Online: [ virt-539 virt-548 ]

>   Full List of Resources:
>     * fence-virt-539	(stonith:fence_xvm):	 Started virt-539
>     * fence-virt-548	(stonith:fence_xvm):	 Started virt-548
>     * dummy	(ocf::pacemaker:Dummy):	 Started virt-539

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled

>   [root@virt-539 ~]# pcs cluster stop --all &>/dev/null & sleep 5; crm_mon -1 --output-as=xml
>   [1] 143514
>   <pacemaker-result api-version="2.9" request="crm_mon -1 --output-as=xml">
>     <summary>
>       <stack type="corosync"/>
>       <current_dc present="true" version="2.1.0-2.el8-7c3f660707" name="virt-548" id="2" with_quorum="true"/>
>       <last_update time="Tue Jun 15 15:26:28 2021"/>
>       <last_change time="Tue Jun 15 15:15:59 2021" user="root" client="cibadmin" origin="virt-539"/>
>       <nodes_configured number="2"/>
>       <resources_configured number="3" disabled="0" blocked="0"/>
>       <cluster_options stonith-enabled="true" symmetric-cluster="true" no-quorum-policy="stop" maintenance-mode="false" stop-all-resources="false"/>
>     </summary>
>     <nodes>
>       <node name="virt-539" id="1" online="true" standby="false" standby_onfail="false" maintenance="false" pending="false" unclean="false" shutdown="true" expected_up="false" is_dc="false" resources_running="1" type="member"/>
>       <node name="virt-548" id="2" online="true" standby="false" standby_onfail="false" maintenance="false" pending="false" unclean="false" shutdown="true" expected_up="false" is_dc="true" resources_running="1" type="member"/>
>     </nodes>
>     <resources>
>       <resource id="fence-virt-539" resource_agent="stonith:fence_xvm" role="Stopped" active="false" orphaned="false" blocked="false" managed="true" failed="false" failure_ignored="false" nodes_running_on="0"/>
>       <resource id="fence-virt-548" resource_agent="stonith:fence_xvm" role="Started" active="true" orphaned="false" blocked="false" managed="true" failed="false" failure_ignored="false" nodes_running_on="1">
>         <node name="virt-548" id="2" cached="true"/>
>       </resource>
>       <resource id="dummy" resource_agent="ocf::pacemaker:Dummy" role="Stopping" active="true" orphaned="false" blocked="false" managed="true" failed="false" failure_ignored="false" nodes_running_on="1">
>         <node name="virt-539" id="1" cached="true"/>
>       </resource>
>     </resources>
>     <node_history>
>       <node name="virt-548">
>         <resource_history id="fence-virt-548" orphan="false" migration-threshold="1000000">
>           <operation_history call="14" task="start" rc="0" rc_text="ok" last-rc-change="Tue Jun 15 15:19:31 2021" exec-time="49ms" queue-time="0ms"/>
>           <operation_history call="15" task="monitor" rc="0" rc_text="ok" interval="60000ms" last-rc-change="Tue Jun 15 15:19:31 2021" exec-time="59ms" queue-time="0ms"/>
>         </resource_history>
>       </node>
>       <node name="virt-539">
>         <resource_history id="fence-virt-539" orphan="false" migration-threshold="1000000">
>           <operation_history call="15" task="monitor" rc="0" rc_text="ok" interval="60000ms" last-rc-change="Tue Jun 15 15:19:31 2021" exec-time="41ms" queue-time="0ms"/>
>           <operation_history call="19" task="stop" rc="0" rc_text="ok" last-rc-change="Tue Jun 15 15:26:24 2021" exec-time="0ms" queue-time="0ms"/>
>         </resource_history>
>         <resource_history id="dummy" orphan="false" migration-threshold="1000000">
>           <operation_history call="17" task="monitor" rc="0" rc_text="ok" interval="10000ms" last-rc-change="Tue Jun 15 15:19:52 2021" exec-time="10028ms" queue-time="0ms"/>
>           <operation_history call="-1" task="stop" rc="193" rc_text="unknown" last-rc-change="Tue Jun 15 15:26:24 2021" exec-time="0ms" queue-time="0ms"/>
>         </resource_history>
>       </node>
>     </node_history>
>     <status code="0" message="OK"/>
>   </pacemaker-result>

XML output shows cluster shutting down and resource stopping.


marking verified in pacemaker-2.1.0-2.el8

Comment 44 errata-xmlrpc 2021-11-09 18:44:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:4267

Note You need to log in before you can comment on or make changes to this bug.