Bug 2086230
| Summary: | crm_mon API result does not validate against schema if fence event has exit-reason | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Reid Wahl <nwahl> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 8.6 | CC: | cluster-maint, hbiswas, kgaillot, msmazova, sbradley, slevine |
| Target Milestone: | rc | Keywords: | Regression, Triaged |
| Target Release: | 8.7 | Flags: | pm-rhel:
mirror+
|
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | pacemaker-2.1.3-1.el8 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: Pacemaker's XML schema for command-line tool output did not include the latest changes in possible output for fencing events.
Consequence: Some pcs commands could fail with an XML schema error.
Fix: The XML schema has been brought up to date.
Result: pcs commands do not fail with an XML schema error.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-11-08 09:42:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
A workaround (in testing so far) is to simply clear the fencing history: `pcs stonith history cleanup`. I flagged as urgent because it is a regression. Feel free to lower since there seems to be a workaround. Fixed upstream as of commit f4e5f094 before fix ----------- > [root@virt-554 ~]# rpm -q pacemaker > pacemaker-2.1.2-4.el8.x86_64 Setup a cluster: > [root@virt-554 ~]# pcs status > Cluster name: STSRHTS26116 > Cluster Summary: > * Stack: corosync > * Current DC: virt-554 (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum > * Last updated: Fri Jun 24 16:54:59 2022 > * Last change: Fri Jun 24 16:35:22 2022 by root via cibadmin on virt-554 > * 2 nodes configured > * 2 resource instances configured > Node List: > * Online: [ virt-554 virt-555 ] > Full List of Resources: > * fence-virt-554 (stonith:fence_xvm): Started virt-554 > * fence-virt-555 (stonith:fence_xvm): Started virt-555 > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Cause fencing to fail: > [root@virt-554 ~]# rm -f /etc/cluster/fence_xvm.key > [root@virt-554 ~]# pcs stonith fence virt-555 > Error: unable to fence 'virt-555' > stonith_admin: Couldn't fence virt-555: Timer expired (Fencing did not complete within a total timeout based on the configured timeout and retries for any devices attempted) > [root@virt-554 ~]# pcs status > Cluster name: STSRHTS26116 > Cluster Summary: > * Stack: corosync > * Current DC: virt-554 (version 2.1.2-4.el8-ada5c3b36e2) - partition with quorum > * Last updated: Fri Jun 24 17:00:52 2022 > * Last change: Fri Jun 24 16:35:22 2022 by root via cibadmin on virt-554 > * 2 nodes configured > * 2 resource instances configured > Node List: > * Online: [ virt-554 virt-555 ] > Full List of Resources: > * fence-virt-554 (stonith:fence_xvm): Started virt-555 > * fence-virt-555 (stonith:fence_xvm): Started virt-555 > Failed Resource Actions: > * fence-virt-554_start_0 on virt-554 'error' (1): call=20, status='Timed Out', exitreason='Fence agent did not complete in time', last-rc-change='Fri Jun 24 16:55:40 2022', queued=0ms, exec=20007ms > * fence-virt-555_start_0 on virt-554 'error' (1): call=24, status='Timed Out', exitreason='Fence agent did not complete in time', last-rc-change='Fri Jun 24 16:56:01 2022', queued=0ms, exec=28970ms > Failed Fencing Actions: > * reboot of virt-555 failed (Fencing did not complete within a total timeout based on the configured timeout and retries for any devices attempted): delegate=virt-554, client=pacemaker-controld.52140, origin=virt-554, last-failed='2022-06-24 16:57:32 +02:00' > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Run `pcs config`: > [root@virt-554 ~]# pcs config > Cluster Name: STSRHTS26116 > Error: cannot load cluster status, xml does not conform to the schema Result: Error: cannot load cluster status, xml does not conform to the schema after fix ---------- > [root@virt-550 ~]# rpm -q pacemaker > pacemaker-2.1.3-2.el8.x86_64 Setup a cluster: > [root@virt-550 ~]# pcs status > Cluster name: STSRHTS18729 > Cluster Summary: > * Stack: corosync > * Current DC: virt-550 (version 2.1.3-2.el8-da2fd79c89) - partition with quorum > * Last updated: Fri Jun 24 18:03:25 2022 > * Last change: Fri Jun 24 18:02:21 2022 by root via cibadmin on virt-550 > * 2 nodes configured > * 2 resource instances configured > Node List: > * Online: [ virt-550 virt-551 ] > Full List of Resources: > * fence-virt-550 (stonith:fence_xvm): Started virt-550 > * fence-virt-551 (stonith:fence_xvm): Started virt-551 > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Cause fencing to fail: > [root@virt-550 ~]# rm -f /etc/cluster/fence_xvm.key > [root@virt-550 ~]# pcs stonith fence virt-551 > Error: unable to fence 'virt-551' > stonith_admin: Couldn't fence virt-551: Timer expired > [root@virt-550 ~]# pcs status > Cluster name: STSRHTS18729 > Cluster Summary: > * Stack: corosync > * Current DC: virt-550 (version 2.1.3-2.el8-da2fd79c89) - partition with quorum > * Last updated: Fri Jun 24 18:06:35 2022 > * Last change: Fri Jun 24 18:02:21 2022 by root via cibadmin on virt-550 > * 2 nodes configured > * 2 resource instances configured > Node List: > * Online: [ virt-550 virt-551 ] > Full List of Resources: > * fence-virt-550 (stonith:fence_xvm): Started virt-551 > * fence-virt-551 (stonith:fence_xvm): Started virt-551 > Failed Resource Actions: > * fence-virt-551_start_0 on virt-550 'error' (1): call=131, status='Timed Out', exitreason='Fence agent did not complete within 20s', last-rc-change='Fri Jun 24 18:04:14 2022', queued=0ms, exec=48126ms > * fence-virt-550_start_0 on virt-550 'error' (1): call=127, status='Timed Out', exitreason='Fence agent did not complete within 20s', last-rc-change='Fri Jun 24 18:03:54 2022', queued=0ms, exec=20006ms > Failed Fencing Actions: > * reboot of virt-551 failed: delegate=virt-550, client=pacemaker-controld.2891, origin=virt-550, last-failed='2022-06-24 18:06:05 +02:00' > * reboot of virt-551 failed: delegate=virt-550, client=stonith_admin.8067, origin=virt-550, last-failed='2022-06-24 18:06:05 +02:00' > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Run `pcs config`: > [root@virt-550 ~]# pcs config > Cluster Name: STSRHTS18729 > Corosync Nodes: > virt-550 virt-551 > Pacemaker Nodes: > virt-550 virt-551 > Resources: > Stonith Devices: > Resource: fence-virt-550 (class=stonith type=fence_xvm) > Attributes: fence-virt-550-instance_attributes > delay=5 > pcmk_host_check=static-list > pcmk_host_list=virt-550 > pcmk_host_map=virt-550:virt-550.cluster-qe.lab.eng.brq.redhat.com > Operations: > monitor: fence-virt-550-monitor-interval-60s > interval=60s > Resource: fence-virt-551 (class=stonith type=fence_xvm) > Attributes: fence-virt-551-instance_attributes > pcmk_host_check=static-list > pcmk_host_list=virt-551 > pcmk_host_map=virt-551:virt-551.cluster-qe.lab.eng.brq.redhat.com > Operations: > monitor: fence-virt-551-monitor-interval-60s > interval=60s > Fencing Levels: > Location Constraints: > Ordering Constraints: > Colocation Constraints: > Ticket Constraints: > Alerts: > Alert: forwarder (path=/usr/tests/sts-rhel8.7/pacemaker/alerts/alert_forwarder.py) > Recipients: > Recipient: forwarder-recipient (value=http://do.not.start.xmlrpc) > Resources Defaults: > No defaults set > Operations Defaults: > No defaults set > Cluster Properties: > cluster-infrastructure: corosync > cluster-name: STSRHTS18729 > dc-version: 2.1.3-2.el8-da2fd79c89 > have-watchdog: false > last-lrm-refresh: 1656086241 > Tags: > No tags defined > Quorum: > Options: Result: pcs config is displayed without error marking verified in pacemaker-2.1.3-2.el8 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:7573 |
Description of problem: If a fence_event element contains an exit-reason attribute, then the `crm_mon` output does not validate. This is because the crm_mon API schema was not updated when the fence-event schema was updated. [reid@laptop api]$ ls fence-event* fence-event-2.0.rng fence-event-2.15.rng [reid@laptop api]$ diff fence-event-2.* 21a22,24 > <attribute name="exit-reason"> <text /> </attribute> > </optional> > <optional> [reid@laptop api]$ ls crm_mon* crm_mon-2.0.rng crm_mon-2.12.rng crm_mon-2.13.rng crm_mon-2.1.rng crm_mon-2.2.rng crm_mon-2.3.rng crm_mon-2.4.rng crm_mon-2.7.rng crm_mon-2.8.rng crm_mon-2.9.rng [reid@laptop api]$ grep fence-event crm_mon-2.13.rng <ref name="fence-event-list" /> <define name="fence-event-list"> <externalRef href="fence-event-2.0.rng" /> A customer reported that a rolling upgrade from 2.1.0-8 to 2.1.2-4 caused some pcs commands (e.g., standby/unstandby, config) to fail on the upgraded node, with an XML schema error. It turns out that if there's a fencing failure in the history when the rolling upgrade is performed, that triggers the issue by leaving a fencing failure with an exit-reason in the history. The following gave me a clue on how to reproduce the customer's symptoms, as the customer's fencing history contained the exit_reason below: ``` void stonith__xe_get_result(xmlNode *xml, pcmk__action_result_t *result) { ... /* @COMPAT Peers <=2.1.2 in rolling upgrades provide only a legacy * return code, not a full result, so check for that. */ if (crm_element_value_int(xml, F_STONITH_RC, &rc) == 0) { if ((rc == pcmk_ok) || (rc == -EINPROGRESS)) { exit_status = CRM_EX_OK; } execution_status = stonith__legacy2status(rc); exit_reason = pcmk_strerror(rc); } else { execution_status = PCMK_EXEC_ERROR; exit_reason = "Fencer reply contained neither a full result " "nor a legacy return code (bug?)"; } ``` ----- Version-Release number of selected component (if applicable): pacemaker-2.1.2-4.el8 ----- How reproducible: Always ----- Steps to Reproduce: There may be a simpler way. This is the only way I've found to produce symptoms so far. 1. Cause a fencing failure in a cluster running pacemaker-2.1.0. (Or inject a failure, if there's a way to do so.) 2. Perform a rolling update to pacemaker-2.1.2 on one node. Put the node in standby -> stop and disable the cluster on the node -> update pacemaker -> reboot -> start the cluster on the node. 3. Run `pcs config` and `pcs node unstandby` on the upgraded node. ----- Actual results: Error: cannot load cluster status, xml does not conform to the schema ----- Expected results: No error ----- Full demonstration: [root@fastvm-rhel-8-0-23 pacemaker]# date && rpm -q pacemaker Sat May 14 14:00:08 PDT 2022 pacemaker-2.1.0-8.el8.x86_64 [root@fastvm-rhel-8-0-24 pcs]# date && rpm -q pacemaker Sat May 14 14:00:21 PDT 2022 pacemaker-2.1.0-8.el8.x86_64 # # Created a stonith device that contains the node in pcmk_host_list but is incapable of rebooting the node. # # You can probably produce the same issue (with "No such device" instead of the "Fencer reply..." error message) # # by having no stonith device associated with the node. [root@fastvm-rhel-8-0-23 pacemaker]# pcs stonith create vmfence fence_vmware_rest pcmk_host_list=node2 <other options> [root@fastvm-rhel-8-0-23 pacemaker]# pcs stonith fence node2 Error: unable to fence 'node2' [root@fastvm-rhel-8-0-23 pacemaker]# pcs stonith confirm node2 WARNING: If node node2 is not powered off or it does have access to shared resources, data corruption and/or cluster failure may occur. Are you sure you want to continue? [y/N] y Node: node2 confirmed fenced [root@fastvm-rhel-8-0-23 pacemaker]# crm_mon --one-shot --inactive --exclude=all --include=fencing-failed Failed Fencing Actions: * reboot of node2 failed: delegate=node1, client=stonith_admin.41436, origin=node1, last-failed='2022-05-14 14:03:20 -07:00' [root@fastvm-rhel-8-0-24 pcs]# pcs cluster start Starting Cluster... [root@fastvm-rhel-8-0-24 pcs]# pcs node standby [root@fastvm-rhel-8-0-24 pcs]# pcs cluster stop Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... [root@fastvm-rhel-8-0-24 pcs]# pcs cluster disable [root@fastvm-rhel-8-0-24 pcs]# yum -y update pacemaker pcs ... Upgraded: pacemaker-2.1.2-4.el8.x86_64 pacemaker-cli-2.1.2-4.el8.x86_64 pacemaker-cluster-libs-2.1.2-4.el8.x86_64 pacemaker-libs-2.1.2-4.el8.x86_64 pacemaker-schemas-2.1.2-4.el8.noarch pcs-0.10.12-6.el8.x86_64 Complete! [root@fastvm-rhel-8-0-24 pcs]# date && systemctl reboot Sat May 14 14:07:39 PDT 2022 [root@fastvm-rhel-8-0-24 ~]# date && rpm -q pacemaker Sat May 14 14:08:04 PDT 2022 pacemaker-2.1.2-4.el8.x86_64 [root@fastvm-rhel-8-0-24 ~]# pcs cluster start Starting Cluster... [root@fastvm-rhel-8-0-24 ~]# crm_mon --one-shot --inactive --exclude=all --include=fencing-failed Failed Fencing Actions: * reboot of node2 failed (Fencer reply contained neither a full result nor a legacy return code (bug?)): delegate=node1, client=stonith_admin.41436, origin=node1, last-failed='2022-05-14 14:03:20 -07:00' [root@fastvm-rhel-8-0-24 ~]# pcs config Cluster Name: testcluster Error: cannot load cluster status, xml does not conform to the schema [root@fastvm-rhel-8-0-24 ~]# pcs node unstandby Error: cannot load cluster status, xml does not conform to the schema