2086230 – crm_mon API result does not validate against schema if fence event has exit-reason

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2086230 - crm_mon API result does not validate against schema if fence event has exit-reason

Summary: crm_mon API result does not validate against schema if fence event has exit-r...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.6
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	8.7
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-14 21:26 UTC by Reid Wahl
Modified:	2022-11-08 10:38 UTC (History)
CC List:	6 users (show)
Fixed In Version:	pacemaker-2.1.3-1.el8
Doc Type:	Bug Fix
Doc Text:	Cause: Pacemaker's XML schema for command-line tool output did not include the latest changes in possible output for fencing events. Consequence: Some pcs commands could fail with an XML schema error. Fix: The XML schema has been brought up to date. Result: pcs commands do not fail with an XML schema error.
Clone Of:
Environment:
Last Closed:	2022-11-08 09:42:30 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ClusterLabs pacemaker pull 2709	None	open	Fix: schemas: Use fence-event-2.15.rng	2022-05-14 22:17:05 UTC
Red Hat Issue Tracker	KCSOPP-1546	None	None	None	2022-06-22 19:04:55 UTC
Red Hat Issue Tracker	RHELPLAN-122045	None	None	None	2022-05-14 21:31:45 UTC
Red Hat Knowledge Base (Solution)	6964927	None	None	None	2022-06-27 19:53:57 UTC
Red Hat Product Errata	RHBA-2022:7573	None	None	None	2022-11-08 09:42:40 UTC

Description Reid Wahl 2022-05-14 21:26:13 UTC

Description of problem:

If a fence_event element contains an exit-reason attribute, then the `crm_mon` output does not validate. This is because the crm_mon API schema was not updated when the fence-event schema was updated.

[reid@laptop api]$ ls fence-event*
fence-event-2.0.rng  fence-event-2.15.rng

[reid@laptop api]$ diff fence-event-2.*
21a22,24
>                 <attribute name="exit-reason"> <text /> </attribute>
>             </optional>
>             <optional>

[reid@laptop api]$ ls crm_mon*
crm_mon-2.0.rng  crm_mon-2.12.rng  crm_mon-2.13.rng  crm_mon-2.1.rng  crm_mon-2.2.rng  crm_mon-2.3.rng  crm_mon-2.4.rng  crm_mon-2.7.rng  crm_mon-2.8.rng  crm_mon-2.9.rng

[reid@laptop api]$ grep fence-event crm_mon-2.13.rng 
            <ref name="fence-event-list" />
    <define name="fence-event-list">
                <externalRef href="fence-event-2.0.rng" />


A customer reported that a rolling upgrade from 2.1.0-8 to 2.1.2-4 caused some pcs commands (e.g., standby/unstandby, config) to fail on the upgraded node, with an XML schema error.

It turns out that if there's a fencing failure in the history when the rolling upgrade is performed, that triggers the issue by leaving a fencing failure with an exit-reason in the history. The following gave me a clue on how to reproduce the customer's symptoms, as the customer's fencing history contained the exit_reason below:

```
void
stonith__xe_get_result(xmlNode *xml, pcmk__action_result_t *result)
{
...
        /* @COMPAT Peers <=2.1.2 in rolling upgrades provide only a legacy
         * return code, not a full result, so check for that.
         */
        if (crm_element_value_int(xml, F_STONITH_RC, &rc) == 0) {
            if ((rc == pcmk_ok) || (rc == -EINPROGRESS)) {
                exit_status = CRM_EX_OK; 
            }
            execution_status = stonith__legacy2status(rc);
            exit_reason = pcmk_strerror(rc);

        } else {
            execution_status = PCMK_EXEC_ERROR;
            exit_reason = "Fencer reply contained neither a full result "
                          "nor a legacy return code (bug?)";
        }
```

-----

Version-Release number of selected component (if applicable):

pacemaker-2.1.2-4.el8

-----

How reproducible:

Always

-----

Steps to Reproduce:

There may be a simpler way. This is the only way I've found to produce symptoms so far.

1. Cause a fencing failure in a cluster running pacemaker-2.1.0. (Or inject a failure, if there's a way to do so.)
2. Perform a rolling update to pacemaker-2.1.2 on one node. Put the node in standby -> stop and disable the cluster on the node -> update pacemaker -> reboot -> start the cluster on the node.
3. Run `pcs config` and `pcs node unstandby` on the upgraded node.

-----

Actual results:

Error: cannot load cluster status, xml does not conform to the schema

-----

Expected results:

No error

-----

Full demonstration:

[root@fastvm-rhel-8-0-23 pacemaker]# date && rpm -q pacemaker
Sat May 14 14:00:08 PDT 2022
pacemaker-2.1.0-8.el8.x86_64

[root@fastvm-rhel-8-0-24 pcs]# date && rpm -q pacemaker
Sat May 14 14:00:21 PDT 2022
pacemaker-2.1.0-8.el8.x86_64

# # Created a stonith device that contains the node in pcmk_host_list but is incapable of rebooting the node.
# # You can probably produce the same issue (with "No such device" instead of the "Fencer reply..." error message)
# # by having no stonith device associated with the node.
[root@fastvm-rhel-8-0-23 pacemaker]# pcs stonith create vmfence fence_vmware_rest pcmk_host_list=node2 <other options>

[root@fastvm-rhel-8-0-23 pacemaker]# pcs stonith fence node2
Error: unable to fence 'node2'

[root@fastvm-rhel-8-0-23 pacemaker]# pcs stonith confirm node2
WARNING: If node node2 is not powered off or it does have access to shared resources, data corruption and/or cluster failure may occur. Are you sure you want to continue? [y/N] y
Node: node2 confirmed fenced

[root@fastvm-rhel-8-0-23 pacemaker]# crm_mon --one-shot --inactive --exclude=all --include=fencing-failed
Failed Fencing Actions:
  * reboot of node2 failed: delegate=node1, client=stonith_admin.41436, origin=node1, last-failed='2022-05-14 14:03:20 -07:00' 

[root@fastvm-rhel-8-0-24 pcs]# pcs cluster start
Starting Cluster...
[root@fastvm-rhel-8-0-24 pcs]# pcs node standby
[root@fastvm-rhel-8-0-24 pcs]# pcs cluster stop
Stopping Cluster (pacemaker)...
Stopping Cluster (corosync)...
[root@fastvm-rhel-8-0-24 pcs]# pcs cluster disable

[root@fastvm-rhel-8-0-24 pcs]# yum -y update pacemaker pcs
...
Upgraded:
  pacemaker-2.1.2-4.el8.x86_64  pacemaker-cli-2.1.2-4.el8.x86_64  pacemaker-cluster-libs-2.1.2-4.el8.x86_64  pacemaker-libs-2.1.2-4.el8.x86_64  pacemaker-schemas-2.1.2-4.el8.noarch  pcs-0.10.12-6.el8.x86_64 

Complete!
[root@fastvm-rhel-8-0-24 pcs]# date && systemctl reboot
Sat May 14 14:07:39 PDT 2022

[root@fastvm-rhel-8-0-24 ~]# date && rpm -q pacemaker
Sat May 14 14:08:04 PDT 2022
pacemaker-2.1.2-4.el8.x86_64

[root@fastvm-rhel-8-0-24 ~]# pcs cluster start
Starting Cluster...

[root@fastvm-rhel-8-0-24 ~]# crm_mon --one-shot --inactive --exclude=all --include=fencing-failed
Failed Fencing Actions:
  * reboot of node2 failed (Fencer reply contained neither a full result nor a legacy return code (bug?)): delegate=node1, client=stonith_admin.41436, origin=node1, last-failed='2022-05-14 14:03:20 -07:00' 
  
[root@fastvm-rhel-8-0-24 ~]# pcs config
Cluster Name: testcluster
Error: cannot load cluster status, xml does not conform to the schema
[root@fastvm-rhel-8-0-24 ~]# pcs node unstandby
Error: cannot load cluster status, xml does not conform to the schema

Comment 1 Reid Wahl 2022-05-14 22:22:16 UTC

A workaround (in testing so far) is to simply clear the fencing history: `pcs stonith history cleanup`.

I flagged as urgent because it is a regression. Feel free to lower since there seems to be a workaround.

Comment 5 Ken Gaillot 2022-05-19 23:15:17 UTC

Fixed upstream as of commit f4e5f094

Comment 11 Markéta Smazová 2022-06-24 16:18:33 UTC

before fix
-----------

>   [root@virt-554 ~]# rpm -q pacemaker
>   pacemaker-2.1.2-4.el8.x86_64

Setup a cluster:

>   [root@virt-554 ~]# pcs status
>   Cluster name: STSRHTS26116
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-554 (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum
>     * Last updated: Fri Jun 24 16:54:59 2022
>     * Last change:  Fri Jun 24 16:35:22 2022 by root via cibadmin on virt-554
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ virt-554 virt-555 ]

>   Full List of Resources:
>     * fence-virt-554	(stonith:fence_xvm):	 Started virt-554
>     * fence-virt-555	(stonith:fence_xvm):	 Started virt-555

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Cause fencing to fail:

>   [root@virt-554 ~]# rm -f /etc/cluster/fence_xvm.key
>   [root@virt-554 ~]# pcs stonith fence virt-555
>   Error: unable to fence 'virt-555'
>   stonith_admin: Couldn't fence virt-555: Timer expired (Fencing did not complete within a total timeout based on the configured timeout and retries for any devices attempted)

>   [root@virt-554 ~]# pcs status
>   Cluster name: STSRHTS26116
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-554 (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum
>     * Last updated: Fri Jun 24 17:00:52 2022
>     * Last change:  Fri Jun 24 16:35:22 2022 by root via cibadmin on virt-554
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ virt-554 virt-555 ]

>   Full List of Resources:
>     * fence-virt-554	(stonith:fence_xvm):	 Started virt-555
>     * fence-virt-555	(stonith:fence_xvm):	 Started virt-555

>   Failed Resource Actions:
>     * fence-virt-554_start_0 on virt-554 'error' (1): call=20, status='Timed Out', exitreason='Fence agent did not complete in time', last-rc-change='Fri Jun 24 16:55:40 2022', queued=0ms, exec=20007ms
>     * fence-virt-555_start_0 on virt-554 'error' (1): call=24, status='Timed Out', exitreason='Fence agent did not complete in time', last-rc-change='Fri Jun 24 16:56:01 2022', queued=0ms, exec=28970ms

>   Failed Fencing Actions:
>     * reboot of virt-555 failed (Fencing did not complete within a total timeout based on the configured timeout and retries for any devices attempted): delegate=virt-554, client=pacemaker-controld.52140, origin=virt-554, last-failed='2022-06-24 16:57:32 +02:00'

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Run `pcs config`:

>   [root@virt-554 ~]# pcs config
>   Cluster Name: STSRHTS26116
>   Error: cannot load cluster status, xml does not conform to the schema


Result: Error: cannot load cluster status, xml does not conform to the schema



after fix
----------

>   [root@virt-550 ~]# rpm -q pacemaker
>   pacemaker-2.1.3-2.el8.x86_64

Setup a cluster:

>   [root@virt-550 ~]# pcs status
>   Cluster name: STSRHTS18729
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-550 (version 2.1.3-2.el8-da2fd79c89) - partition with quorum
>     * Last updated: Fri Jun 24 18:03:25 2022
>     * Last change:  Fri Jun 24 18:02:21 2022 by root via cibadmin on virt-550
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ virt-550 virt-551 ]

>   Full List of Resources:
>     * fence-virt-550	(stonith:fence_xvm):	 Started virt-550
>     * fence-virt-551	(stonith:fence_xvm):	 Started virt-551

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Cause fencing to fail:

>   [root@virt-550 ~]# rm -f /etc/cluster/fence_xvm.key
>   [root@virt-550 ~]# pcs stonith fence virt-551
>   Error: unable to fence 'virt-551'
>   stonith_admin: Couldn't fence virt-551: Timer expired

>   [root@virt-550 ~]# pcs status
>   Cluster name: STSRHTS18729
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-550 (version 2.1.3-2.el8-da2fd79c89) - partition with quorum
>     * Last updated: Fri Jun 24 18:06:35 2022
>     * Last change:  Fri Jun 24 18:02:21 2022 by root via cibadmin on virt-550
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ virt-550 virt-551 ]

>   Full List of Resources:
>     * fence-virt-550	(stonith:fence_xvm):	 Started virt-551
>     * fence-virt-551	(stonith:fence_xvm):	 Started virt-551

>   Failed Resource Actions:
>     * fence-virt-551_start_0 on virt-550 'error' (1): call=131, status='Timed Out', exitreason='Fence agent did not complete within 20s', last-rc-change='Fri Jun 24 18:04:14 2022', queued=0ms, exec=48126ms
>     * fence-virt-550_start_0 on virt-550 'error' (1): call=127, status='Timed Out', exitreason='Fence agent did not complete within 20s', last-rc-change='Fri Jun 24 18:03:54 2022', queued=0ms, exec=20006ms

>   Failed Fencing Actions:
>     * reboot of virt-551 failed: delegate=virt-550, client=pacemaker-controld.2891, origin=virt-550, last-failed='2022-06-24 18:06:05 +02:00' 
>     * reboot of virt-551 failed: delegate=virt-550, client=stonith_admin.8067, origin=virt-550, last-failed='2022-06-24 18:06:05 +02:00'

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Run `pcs config`:

>   [root@virt-550 ~]# pcs config
>   Cluster Name: STSRHTS18729
>   Corosync Nodes:
>    virt-550 virt-551
>   Pacemaker Nodes:
>    virt-550 virt-551

>   Resources:

>   Stonith Devices:
>     Resource: fence-virt-550 (class=stonith type=fence_xvm)
>       Attributes: fence-virt-550-instance_attributes
>         delay=5
>         pcmk_host_check=static-list
>         pcmk_host_list=virt-550
>         pcmk_host_map=virt-550:virt-550.cluster-qe.lab.eng.brq.redhat.com
>       Operations:
>         monitor: fence-virt-550-monitor-interval-60s
>           interval=60s
>     Resource: fence-virt-551 (class=stonith type=fence_xvm)
>       Attributes: fence-virt-551-instance_attributes
>         pcmk_host_check=static-list
>         pcmk_host_list=virt-551
>         pcmk_host_map=virt-551:virt-551.cluster-qe.lab.eng.brq.redhat.com
>       Operations:
>         monitor: fence-virt-551-monitor-interval-60s
>           interval=60s
>   Fencing Levels:

>   Location Constraints:
>   Ordering Constraints:
>   Colocation Constraints:
>   Ticket Constraints:

>   Alerts:
>    Alert: forwarder (path=/usr/tests/sts-rhel8.7/pacemaker/alerts/alert_forwarder.py)
>     Recipients:
>      Recipient: forwarder-recipient (value=http://do.not.start.xmlrpc)

>   Resources Defaults:
>     No defaults set
>   Operations Defaults:
>     No defaults set

>   Cluster Properties:
>    cluster-infrastructure: corosync
>    cluster-name: STSRHTS18729
>    dc-version: 2.1.3-2.el8-da2fd79c89
>    have-watchdog: false
>    last-lrm-refresh: 1656086241

>   Tags:
>    No tags defined

>   Quorum:
> Options:


Result: pcs config is displayed without error


marking verified in pacemaker-2.1.3-2.el8

Comment 17 errata-xmlrpc 2022-11-08 09:42:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7573

Note You need to log in before you can comment on or make changes to this bug.