Bug 2086230

Summary: crm_mon API result does not validate against schema if fence event has exit-reason
Product: Red Hat Enterprise Linux 8 Reporter: Reid Wahl <nwahl>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 8.6CC: cluster-maint, hbiswas, kgaillot, msmazova, sbradley, slevine
Target Milestone: rcKeywords: Regression, Triaged
Target Release: 8.7Flags: pm-rhel: mirror+
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: pacemaker-2.1.3-1.el8 Doc Type: Bug Fix
Doc Text:
Cause: Pacemaker's XML schema for command-line tool output did not include the latest changes in possible output for fencing events. Consequence: Some pcs commands could fail with an XML schema error. Fix: The XML schema has been brought up to date. Result: pcs commands do not fail with an XML schema error.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-08 09:42:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Reid Wahl 2022-05-14 21:26:13 UTC
Description of problem:

If a fence_event element contains an exit-reason attribute, then the `crm_mon` output does not validate. This is because the crm_mon API schema was not updated when the fence-event schema was updated.

[reid@laptop api]$ ls fence-event*
fence-event-2.0.rng  fence-event-2.15.rng

[reid@laptop api]$ diff fence-event-2.*
21a22,24
>                 <attribute name="exit-reason"> <text /> </attribute>
>             </optional>
>             <optional>

[reid@laptop api]$ ls crm_mon*
crm_mon-2.0.rng  crm_mon-2.12.rng  crm_mon-2.13.rng  crm_mon-2.1.rng  crm_mon-2.2.rng  crm_mon-2.3.rng  crm_mon-2.4.rng  crm_mon-2.7.rng  crm_mon-2.8.rng  crm_mon-2.9.rng

[reid@laptop api]$ grep fence-event crm_mon-2.13.rng 
            <ref name="fence-event-list" />
    <define name="fence-event-list">
                <externalRef href="fence-event-2.0.rng" />


A customer reported that a rolling upgrade from 2.1.0-8 to 2.1.2-4 caused some pcs commands (e.g., standby/unstandby, config) to fail on the upgraded node, with an XML schema error.

It turns out that if there's a fencing failure in the history when the rolling upgrade is performed, that triggers the issue by leaving a fencing failure with an exit-reason in the history. The following gave me a clue on how to reproduce the customer's symptoms, as the customer's fencing history contained the exit_reason below:

```
void
stonith__xe_get_result(xmlNode *xml, pcmk__action_result_t *result)
{
...
        /* @COMPAT Peers <=2.1.2 in rolling upgrades provide only a legacy
         * return code, not a full result, so check for that.
         */
        if (crm_element_value_int(xml, F_STONITH_RC, &rc) == 0) {
            if ((rc == pcmk_ok) || (rc == -EINPROGRESS)) {
                exit_status = CRM_EX_OK; 
            }
            execution_status = stonith__legacy2status(rc);
            exit_reason = pcmk_strerror(rc);

        } else {
            execution_status = PCMK_EXEC_ERROR;
            exit_reason = "Fencer reply contained neither a full result "
                          "nor a legacy return code (bug?)";
        }
```

-----

Version-Release number of selected component (if applicable):

pacemaker-2.1.2-4.el8

-----

How reproducible:

Always

-----

Steps to Reproduce:

There may be a simpler way. This is the only way I've found to produce symptoms so far.

1. Cause a fencing failure in a cluster running pacemaker-2.1.0. (Or inject a failure, if there's a way to do so.)
2. Perform a rolling update to pacemaker-2.1.2 on one node. Put the node in standby -> stop and disable the cluster on the node -> update pacemaker -> reboot -> start the cluster on the node.
3. Run `pcs config` and `pcs node unstandby` on the upgraded node.

-----

Actual results:

Error: cannot load cluster status, xml does not conform to the schema

-----

Expected results:

No error

-----

Full demonstration:

[root@fastvm-rhel-8-0-23 pacemaker]# date && rpm -q pacemaker
Sat May 14 14:00:08 PDT 2022
pacemaker-2.1.0-8.el8.x86_64

[root@fastvm-rhel-8-0-24 pcs]# date && rpm -q pacemaker
Sat May 14 14:00:21 PDT 2022
pacemaker-2.1.0-8.el8.x86_64

# # Created a stonith device that contains the node in pcmk_host_list but is incapable of rebooting the node.
# # You can probably produce the same issue (with "No such device" instead of the "Fencer reply..." error message)
# # by having no stonith device associated with the node.
[root@fastvm-rhel-8-0-23 pacemaker]# pcs stonith create vmfence fence_vmware_rest pcmk_host_list=node2 <other options>

[root@fastvm-rhel-8-0-23 pacemaker]# pcs stonith fence node2
Error: unable to fence 'node2'

[root@fastvm-rhel-8-0-23 pacemaker]# pcs stonith confirm node2
WARNING: If node node2 is not powered off or it does have access to shared resources, data corruption and/or cluster failure may occur. Are you sure you want to continue? [y/N] y
Node: node2 confirmed fenced

[root@fastvm-rhel-8-0-23 pacemaker]# crm_mon --one-shot --inactive --exclude=all --include=fencing-failed
Failed Fencing Actions:
  * reboot of node2 failed: delegate=node1, client=stonith_admin.41436, origin=node1, last-failed='2022-05-14 14:03:20 -07:00' 

[root@fastvm-rhel-8-0-24 pcs]# pcs cluster start
Starting Cluster...
[root@fastvm-rhel-8-0-24 pcs]# pcs node standby
[root@fastvm-rhel-8-0-24 pcs]# pcs cluster stop
Stopping Cluster (pacemaker)...
Stopping Cluster (corosync)...
[root@fastvm-rhel-8-0-24 pcs]# pcs cluster disable

[root@fastvm-rhel-8-0-24 pcs]# yum -y update pacemaker pcs
...
Upgraded:
  pacemaker-2.1.2-4.el8.x86_64  pacemaker-cli-2.1.2-4.el8.x86_64  pacemaker-cluster-libs-2.1.2-4.el8.x86_64  pacemaker-libs-2.1.2-4.el8.x86_64  pacemaker-schemas-2.1.2-4.el8.noarch  pcs-0.10.12-6.el8.x86_64 

Complete!
[root@fastvm-rhel-8-0-24 pcs]# date && systemctl reboot
Sat May 14 14:07:39 PDT 2022

[root@fastvm-rhel-8-0-24 ~]# date && rpm -q pacemaker
Sat May 14 14:08:04 PDT 2022
pacemaker-2.1.2-4.el8.x86_64

[root@fastvm-rhel-8-0-24 ~]# pcs cluster start
Starting Cluster...

[root@fastvm-rhel-8-0-24 ~]# crm_mon --one-shot --inactive --exclude=all --include=fencing-failed
Failed Fencing Actions:
  * reboot of node2 failed (Fencer reply contained neither a full result nor a legacy return code (bug?)): delegate=node1, client=stonith_admin.41436, origin=node1, last-failed='2022-05-14 14:03:20 -07:00' 
  
[root@fastvm-rhel-8-0-24 ~]# pcs config
Cluster Name: testcluster
Error: cannot load cluster status, xml does not conform to the schema
[root@fastvm-rhel-8-0-24 ~]# pcs node unstandby
Error: cannot load cluster status, xml does not conform to the schema

Comment 1 Reid Wahl 2022-05-14 22:22:16 UTC
A workaround (in testing so far) is to simply clear the fencing history: `pcs stonith history cleanup`.

I flagged as urgent because it is a regression. Feel free to lower since there seems to be a workaround.

Comment 5 Ken Gaillot 2022-05-19 23:15:17 UTC
Fixed upstream as of commit f4e5f094

Comment 11 Markéta Smazová 2022-06-24 16:18:33 UTC
before fix
-----------

>   [root@virt-554 ~]# rpm -q pacemaker
>   pacemaker-2.1.2-4.el8.x86_64

Setup a cluster:

>   [root@virt-554 ~]# pcs status
>   Cluster name: STSRHTS26116
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-554 (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum
>     * Last updated: Fri Jun 24 16:54:59 2022
>     * Last change:  Fri Jun 24 16:35:22 2022 by root via cibadmin on virt-554
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ virt-554 virt-555 ]

>   Full List of Resources:
>     * fence-virt-554	(stonith:fence_xvm):	 Started virt-554
>     * fence-virt-555	(stonith:fence_xvm):	 Started virt-555

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Cause fencing to fail:

>   [root@virt-554 ~]# rm -f /etc/cluster/fence_xvm.key
>   [root@virt-554 ~]# pcs stonith fence virt-555
>   Error: unable to fence 'virt-555'
>   stonith_admin: Couldn't fence virt-555: Timer expired (Fencing did not complete within a total timeout based on the configured timeout and retries for any devices attempted)

>   [root@virt-554 ~]# pcs status
>   Cluster name: STSRHTS26116
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-554 (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum
>     * Last updated: Fri Jun 24 17:00:52 2022
>     * Last change:  Fri Jun 24 16:35:22 2022 by root via cibadmin on virt-554
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ virt-554 virt-555 ]

>   Full List of Resources:
>     * fence-virt-554	(stonith:fence_xvm):	 Started virt-555
>     * fence-virt-555	(stonith:fence_xvm):	 Started virt-555

>   Failed Resource Actions:
>     * fence-virt-554_start_0 on virt-554 'error' (1): call=20, status='Timed Out', exitreason='Fence agent did not complete in time', last-rc-change='Fri Jun 24 16:55:40 2022', queued=0ms, exec=20007ms
>     * fence-virt-555_start_0 on virt-554 'error' (1): call=24, status='Timed Out', exitreason='Fence agent did not complete in time', last-rc-change='Fri Jun 24 16:56:01 2022', queued=0ms, exec=28970ms

>   Failed Fencing Actions:
>     * reboot of virt-555 failed (Fencing did not complete within a total timeout based on the configured timeout and retries for any devices attempted): delegate=virt-554, client=pacemaker-controld.52140, origin=virt-554, last-failed='2022-06-24 16:57:32 +02:00'

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Run `pcs config`:

>   [root@virt-554 ~]# pcs config
>   Cluster Name: STSRHTS26116
>   Error: cannot load cluster status, xml does not conform to the schema


Result: Error: cannot load cluster status, xml does not conform to the schema



after fix
----------

>   [root@virt-550 ~]# rpm -q pacemaker
>   pacemaker-2.1.3-2.el8.x86_64

Setup a cluster:

>   [root@virt-550 ~]# pcs status
>   Cluster name: STSRHTS18729
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-550 (version 2.1.3-2.el8-da2fd79c89) - partition with quorum
>     * Last updated: Fri Jun 24 18:03:25 2022
>     * Last change:  Fri Jun 24 18:02:21 2022 by root via cibadmin on virt-550
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ virt-550 virt-551 ]

>   Full List of Resources:
>     * fence-virt-550	(stonith:fence_xvm):	 Started virt-550
>     * fence-virt-551	(stonith:fence_xvm):	 Started virt-551

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Cause fencing to fail:

>   [root@virt-550 ~]# rm -f /etc/cluster/fence_xvm.key
>   [root@virt-550 ~]# pcs stonith fence virt-551
>   Error: unable to fence 'virt-551'
>   stonith_admin: Couldn't fence virt-551: Timer expired

>   [root@virt-550 ~]# pcs status
>   Cluster name: STSRHTS18729
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-550 (version 2.1.3-2.el8-da2fd79c89) - partition with quorum
>     * Last updated: Fri Jun 24 18:06:35 2022
>     * Last change:  Fri Jun 24 18:02:21 2022 by root via cibadmin on virt-550
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ virt-550 virt-551 ]

>   Full List of Resources:
>     * fence-virt-550	(stonith:fence_xvm):	 Started virt-551
>     * fence-virt-551	(stonith:fence_xvm):	 Started virt-551

>   Failed Resource Actions:
>     * fence-virt-551_start_0 on virt-550 'error' (1): call=131, status='Timed Out', exitreason='Fence agent did not complete within 20s', last-rc-change='Fri Jun 24 18:04:14 2022', queued=0ms, exec=48126ms
>     * fence-virt-550_start_0 on virt-550 'error' (1): call=127, status='Timed Out', exitreason='Fence agent did not complete within 20s', last-rc-change='Fri Jun 24 18:03:54 2022', queued=0ms, exec=20006ms

>   Failed Fencing Actions:
>     * reboot of virt-551 failed: delegate=virt-550, client=pacemaker-controld.2891, origin=virt-550, last-failed='2022-06-24 18:06:05 +02:00' 
>     * reboot of virt-551 failed: delegate=virt-550, client=stonith_admin.8067, origin=virt-550, last-failed='2022-06-24 18:06:05 +02:00'

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Run `pcs config`:

>   [root@virt-550 ~]# pcs config
>   Cluster Name: STSRHTS18729
>   Corosync Nodes:
>    virt-550 virt-551
>   Pacemaker Nodes:
>    virt-550 virt-551

>   Resources:

>   Stonith Devices:
>     Resource: fence-virt-550 (class=stonith type=fence_xvm)
>       Attributes: fence-virt-550-instance_attributes
>         delay=5
>         pcmk_host_check=static-list
>         pcmk_host_list=virt-550
>         pcmk_host_map=virt-550:virt-550.cluster-qe.lab.eng.brq.redhat.com
>       Operations:
>         monitor: fence-virt-550-monitor-interval-60s
>           interval=60s
>     Resource: fence-virt-551 (class=stonith type=fence_xvm)
>       Attributes: fence-virt-551-instance_attributes
>         pcmk_host_check=static-list
>         pcmk_host_list=virt-551
>         pcmk_host_map=virt-551:virt-551.cluster-qe.lab.eng.brq.redhat.com
>       Operations:
>         monitor: fence-virt-551-monitor-interval-60s
>           interval=60s
>   Fencing Levels:

>   Location Constraints:
>   Ordering Constraints:
>   Colocation Constraints:
>   Ticket Constraints:

>   Alerts:
>    Alert: forwarder (path=/usr/tests/sts-rhel8.7/pacemaker/alerts/alert_forwarder.py)
>     Recipients:
>      Recipient: forwarder-recipient (value=http://do.not.start.xmlrpc)

>   Resources Defaults:
>     No defaults set
>   Operations Defaults:
>     No defaults set

>   Cluster Properties:
>    cluster-infrastructure: corosync
>    cluster-name: STSRHTS18729
>    dc-version: 2.1.3-2.el8-da2fd79c89
>    have-watchdog: false
>    last-lrm-refresh: 1656086241

>   Tags:
>    No tags defined

>   Quorum:
> Options:


Result: pcs config is displayed without error


marking verified in pacemaker-2.1.3-2.el8

Comment 17 errata-xmlrpc 2022-11-08 09:42:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7573