Bug 1931332
| Summary: | pacemaker generates invalid xml for certain failures | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Michele Baldessari <michele> | |
| Component: | pacemaker | Assignee: | Chris Lumens <clumens> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 8.2 | CC: | cfeist, clumens, cluster-maint, jwboyer, kgaillot, nwahl, pkomarov | |
| Target Milestone: | rc | Keywords: | Regression, Triaged, ZStream | |
| Target Release: | 8.4 | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | pacemaker-2.0.5-9.el8 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Pacemaker can record certain resource operation history entries with a fake call ID of -1, but the status schema restricts the field to a nonnegative integer.
Consequence: When such history is present, configuration changes via pcs fail with a "does not conform to the schema" message.
Fix: The schema now allows any integer.
Result: These history entries do not prevent further configuration commands.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1939533 1939534 (view as bug list) | Environment: | ||
| Last Closed: | 2021-05-18 15:26:45 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | 2.1.0 | |
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1939533, 1939534 | |||
I forgot to mention. This is with pacemaker-2.0.5-8 but I don't think this is a regression. The reason we saw it now and not really before is likely because lately we're trying to move all pcs commands on the host (as opposed to be run from ephemeral containers) The failure matches this history item.
<lrm_rsc_op id="stonith-fence_compute-fence-nova_last_failure_0" operation_key="stonith-fence_compute-fence-nova_monitor_60000" operation="monitor" crm-debug-origin="controld_record_action_timeout" crm_feature_set="3.7.1" transition-key="369:6:0:e29c9949-abfc-4abb-8c56-c9e1be79db9f" transition-magic="2:1;369:6:0:e29c9949-abfc-4abb-8c56-c9e1be79db9f" exit-reason="" on_node="database-2" call-id="-1" rc-code="1" op-status="2" interval="60000" last-rc-change="1613961876" last-run="1613961591" exec-time="0" queue-time="0" op-digest="86d61bd5ef02bd9d140302b716cd006c"/>
In controld_record_action_timeout, we set op->call_id to -1 and then call pcmk__create_history_xml to build a history item out of it.
~~~
void
controld_record_action_timeout(crm_action_t *action)
{
...
/* If the executor gets a timeout while waiting for the action to complete,
* that will be reported via the usual callback. This timeout means that we
* didn't hear from the executor or the controller that relayed the action
* to the executor.
*
* @TODO Using PCMK_OCF_UNKNOWN_ERROR instead of PCMK_OCF_TIMEOUT is one way
* to distinguish those situations, but perhaps PCMK_OCF_TIMEOUT would be
* preferable anyway.
*/
op = convert_graph_action(NULL, action, PCMK_LRM_OP_TIMEOUT,
PCMK_OCF_UNKNOWN_ERROR);
op->call_id = -1;
op->user_data = pcmk__transition_key(transition_graph->id, action->id,
target_rc, te_uuid);
xml_op = pcmk__create_history_xml(rsc, op, CRM_FEATURE_SET, target_rc,
target, __func__, LOG_INFO);
~~~
~~~
/*!
* \internal
* \brief Create XML for resource operation history update
...
*/
xmlNode *
pcmk__create_history_xml(xmlNode *parent, lrmd_event_data_t *op,
const char *caller_version, int target_rc,
const char *node, const char *origin, int level)
{
...
crm_xml_add_int(xml_op, XML_LRM_ATTR_CALLID, op->call_id);
~~~
----------
So it sounds like one of two things needs to happen:
- The schema needs to accept an integer here rather than only a nonNegativeInteger, OR
- We need to use something else instead of -1 for the call ID (example logic in get_fake_call_id in controld_execd.c comes to mind).
It looks like the schema has only existed for 4 months (https://github.com/ClusterLabs/pacemaker/commit/c718c879), while we've been setting op->call_id to -1 for 10 years (https://github.com/ClusterLabs/pacemaker/blame/master/daemons/controld/controld_te_actions.c#L247), so I suspect we just need to change a line of the crm_mon schema.
(BTW, I don't know if this is intentional or not -- crm_mon.rng references api/crm_mon-2.3.rng rather than api/crm_mon-2.4.rng).
@clumens, can you take a look at this?
(In reply to Reid Wahl from comment #2) > It looks like the schema has only existed for 4 months > (https://github.com/ClusterLabs/pacemaker/commit/c718c879) More like a year and a half. - https://github.com/ClusterLabs/pacemaker/commit/51da3460 I was looking at the wrong schema version. I got confused because both the crm_resource-2.x.rng and crm_mon-2.x.rng files in xml/api contain call as a nonNegativeInteger. It looks like this may have been a regression in 2.0.3? > It looks like the schema has only existed for 4 months > (https://github.com/ClusterLabs/pacemaker/commit/c718c879), while we've been > setting op->call_id to -1 for 10 years > (https://github.com/ClusterLabs/pacemaker/blame/master/daemons/controld/ > controld_te_actions.c#L247), so I suspect we just need to change a line of > the crm_mon schema. Yeah, we should just change that to not be a nonNegativeInteger. I just didn't know a negative value was allowed there. If it can be negative in other places, might as well change all those at the same time. I don't know if this requires making new schema files or if we can just fix the ones that already exist. These files went into public releases. > (BTW, I don't know if this is intentional or not -- crm_mon.rng references > api/crm_mon-2.3.rng rather than api/crm_mon-2.4.rng). Pretty sure that should always reference the latest version of the schema. api/crm_mon.rng is there to validate the output of "crm_mon --as-xml", which has different outermost XML tags than "crm_mon --output-as xml" does. (In reply to Reid Wahl from comment #2) > So it sounds like one of two things needs to happen: > - The schema needs to accept an integer here rather than only a > nonNegativeInteger, OR > - We need to use something else instead of -1 for the call ID (example > logic in get_fake_call_id in controld_execd.c comes to mind). Or both. The schema change is the most important, and sufficient for this bz's purposes. We use -1 for the call ID for pending operations as well (in daemons/controld/controld_execd.c:record_pending_op()), and that's a reasonable value there. I think it would actually make more sense to use unsigned call IDs, with 0 for pending and positive for completed, but the current design is reasonable and probably not worth changing for compatibility reasons. However I'm not sure using either -1 in controld_record_action_timeout() or an incremented ID in get_fake_call_id() is ideal. I would think controld_record_action_timeout() is susceptible to the issue described in get_fake_call_id(), but I'm not sure why it's never been reported as an issue in practice. On the other hand, I would think get_fake_call_id() could possibly result in a duplicate call ID with the next successful call from the same executor connection; that's likely not a problem because (IIRC) if call IDs are identical then timestamps are used for comparing two history entries. In any case, we can't use the logic from get_fake_call_id() in controld_record_action_timeout() because the executor connection may not be available at that point. Bottom line, I'm not sure what should be done about the -1 itself, but the schema change is definitely worthwhile and sufficient for this bz. (In reply to Chris Lumens from comment #4) > > (BTW, I don't know if this is intentional or not -- crm_mon.rng references > > api/crm_mon-2.3.rng rather than api/crm_mon-2.4.rng). > > Pretty sure that should always reference the latest version of the schema. > api/crm_mon.rng is there to validate the output of "crm_mon --as-xml", which > has different outermost XML tags than "crm_mon --output-as xml" does. The best fix for that would be to dynamically generate crm_mon.rng rather than have it static in the repo. Eventually we want to remove it, but that's a ways in the future ... Potential fix: https://github.com/ClusterLabs/pacemaker/pull/2316 Tested and verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 |
Description of problem: Sometimes we have seen CI failures when running pcs resource disable --wait haproxy-bundle. The error is: 2021-02-22 04:38:20.472281 | 525400f6-9caf-5594-4294-000000009626 | TASK | Run pacemaker restart if the config file for the service changed 2021-02-22 04:38:23.852339 | 525400f6-9caf-5594-4294-000000009626 | FATAL | Run pacemaker restart if the config file for the service changed | controller-0 | error={"changed": false, "error": "Failed running command", "msg": "Error running /var/lib/container-config-scripts/pacemaker_restart_bundle.sh haproxy haproxy-bundle haproxy-bundle Started. rc: 1, stdout: Mon Feb 22 04:38:21 UTC 2021: Restarting haproxy-bundle globally. Stopping:\nMon Feb 22 04:38:22 UTC 2021: Restarting haproxy-bundle globally. Starting:\n, stderr: Error: cannot load cluster status, xml does not conform to the schema\nError: cannot load cluster status, xml does not conform to the schema\n"} 2021-02-22 04:38:23.853052 | 525400f6-9caf-5594-4294-000000009626 | TIMING | tripleo_ha_wrapper : Run pacemaker restart if the config file for the service changed | controller-0 | 0:21:30.242154 | 3.38s Damien and I investigated this a bit more in depth and we can see that pcs calls both /usr/sbin/cibadmin --local --query and /usr/sbin/crm_mon --one-shot --as-xml --inactive successfully. The problem comes when pcs tries to validate the output of the former crm_mon command. In fact when we do the following we can reproduce: /usr/sbin/crm_mon --one-shot --as-xml --inactive 2>&1 | tee /tmp/crm xmllint /tmp/crm --relaxng /usr/share/pacemaker/crm_mon.rng - /tmp/crm:702: element failure: Relax-NG validity error : Type nonNegativeInteger doesn't allow value '-1' Relax-NG validity error : Element failure failed to validate attributes /tmp/crm:702: element failure: Relax-NG validity error : Invalid attribute call for element failure /tmp/crm fails to validate The issue seems to be the following bit of XML: <failures> <failure op_key="stonith-fence_compute-fence-nova_monitor_60000" node="database-2" exitstatus="error" exitreason="" exitcode="1" call="-1" status="Timed Out" last-rc-change="2021-02-22 02:44:36Z" queued="0" exec="0" interval="60000" task="monitor"/> </failures> Specifically it is the 'call="-1"'. If we replace that to some other positive value the xml validates without issues. Full logs, crm output, pcs debug output + pcs straced can be found here: http://file.rdu.redhat.com/~mbaldess/pcmk-xml-validation-error/