1931332 – pacemaker generates invalid xml for certain failures

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1931332 - pacemaker generates invalid xml for certain failures

Summary: pacemaker generates invalid xml for certain failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.2
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	8.4
Assignee:	Chris Lumens
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1939533 1939534
TreeView+	depends on / blocked

Reported:	2021-02-22 08:11 UTC by Michele Baldessari
Modified:	2021-05-18 15:28 UTC (History)
CC List:	7 users (show)
Fixed In Version:	pacemaker-2.0.5-9.el8
Doc Type:	Bug Fix
Doc Text:	Cause: Pacemaker can record certain resource operation history entries with a fake call ID of -1, but the status schema restricts the field to a nonnegative integer. Consequence: When such history is present, configuration changes via pcs fail with a "does not conform to the schema" message. Fix: The schema now allows any integer. Result: These history entries do not prevent further configuration commands.
Clone Of:
Clones:	1939533 1939534 (view as bug list)
Environment:
Last Closed:	2021-05-18 15:26:45 UTC
Type:	Bug
Target Upstream Version:	2.1.0
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Michele Baldessari 2021-02-22 08:11:32 UTC

Description of problem:
Sometimes we have seen CI failures when running pcs resource disable --wait haproxy-bundle. The error is:
2021-02-22 04:38:20.472281 | 525400f6-9caf-5594-4294-000000009626 | TASK | Run pacemaker restart if the config file for the service changed
2021-02-22 04:38:23.852339 | 525400f6-9caf-5594-4294-000000009626 | FATAL | Run pacemaker restart if the config file for the service changed | controller-0 | error={"changed": false, "error": "Failed running command", "msg": "Error running /var/lib/container-config-scripts/pacemaker_restart_bundle.sh haproxy haproxy-bundle haproxy-bundle Started. rc: 1, stdout: Mon Feb 22 04:38:21 UTC 2021: Restarting haproxy-bundle globally. Stopping:\nMon Feb 22 04:38:22 UTC 2021: Restarting haproxy-bundle globally. Starting:\n, stderr: Error: cannot load cluster status, xml does not conform to the schema\nError: cannot load cluster status, xml does not conform to the schema\n"}
2021-02-22 04:38:23.853052 | 525400f6-9caf-5594-4294-000000009626 | TIMING | tripleo_ha_wrapper : Run pacemaker restart if the config file for the service changed | controller-0 | 0:21:30.242154 | 3.38s 


Damien and I investigated this a bit more in depth and we can see that pcs calls both /usr/sbin/cibadmin --local --query and /usr/sbin/crm_mon --one-shot --as-xml --inactive successfully.

The problem comes when pcs tries to validate the output of the former crm_mon command. In fact when we do the following we can reproduce:
/usr/sbin/crm_mon --one-shot --as-xml --inactive 2>&1 | tee /tmp/crm
xmllint /tmp/crm --relaxng /usr/share/pacemaker/crm_mon.rng -
/tmp/crm:702: element failure: Relax-NG validity error : Type nonNegativeInteger doesn't allow value '-1'
Relax-NG validity error : Element failure failed to validate attributes
/tmp/crm:702: element failure: Relax-NG validity error : Invalid attribute call for element failure
/tmp/crm fails to validate

The issue seems to be the following bit of XML:
  <failures>
    <failure op_key="stonith-fence_compute-fence-nova_monitor_60000" node="database-2" exitstatus="error" exitreason="" exitcode="1" call="-1" status="Timed Out" last-rc-change="2021-02-22 02:44:36Z" queued="0" exec="0" interval="60000" task="monitor"/>
  </failures>

Specifically it is the 'call="-1"'. If we replace that to some other positive value the xml validates without issues.

Full logs, crm output, pcs debug output + pcs straced can be found here:
http://file.rdu.redhat.com/~mbaldess/pcmk-xml-validation-error/

Comment 1 Michele Baldessari 2021-02-22 08:13:20 UTC

I forgot to mention. This is with pacemaker-2.0.5-8 but I don't think this is a regression. The reason we saw it now and not really before is likely because lately we're trying to move all pcs commands on the host (as opposed to be run from ephemeral containers)

Comment 2 Reid Wahl 2021-02-22 22:07:20 UTC

The failure matches this history item.

            <lrm_rsc_op id="stonith-fence_compute-fence-nova_last_failure_0" operation_key="stonith-fence_compute-fence-nova_monitor_60000" operation="monitor" crm-debug-origin="controld_record_action_timeout" crm_feature_set="3.7.1" transition-key="369:6:0:e29c9949-abfc-4abb-8c56-c9e1be79db9f" transition-magic="2:1;369:6:0:e29c9949-abfc-4abb-8c56-c9e1be79db9f" exit-reason="" on_node="database-2" call-id="-1" rc-code="1" op-status="2" interval="60000" last-rc-change="1613961876" last-run="1613961591" exec-time="0" queue-time="0" op-digest="86d61bd5ef02bd9d140302b716cd006c"/>


In controld_record_action_timeout, we set op->call_id to -1 and then call pcmk__create_history_xml to build a history item out of it.

~~~
void
controld_record_action_timeout(crm_action_t *action)
{
...
    /* If the executor gets a timeout while waiting for the action to complete,
     * that will be reported via the usual callback. This timeout means that we
     * didn't hear from the executor or the controller that relayed the action
     * to the executor.
     *
     * @TODO Using PCMK_OCF_UNKNOWN_ERROR instead of PCMK_OCF_TIMEOUT is one way
     * to distinguish those situations, but perhaps PCMK_OCF_TIMEOUT would be
     * preferable anyway.
     */
    op = convert_graph_action(NULL, action, PCMK_LRM_OP_TIMEOUT,
                              PCMK_OCF_UNKNOWN_ERROR);
    op->call_id = -1;
    op->user_data = pcmk__transition_key(transition_graph->id, action->id,
                                         target_rc, te_uuid);

    xml_op = pcmk__create_history_xml(rsc, op, CRM_FEATURE_SET, target_rc,
                                      target, __func__, LOG_INFO);
~~~

~~~
/*!
 * \internal
 * \brief Create XML for resource operation history update
...
 */
xmlNode *
pcmk__create_history_xml(xmlNode *parent, lrmd_event_data_t *op,
                         const char *caller_version, int target_rc,
                         const char *node, const char *origin, int level)
{
...
    crm_xml_add_int(xml_op, XML_LRM_ATTR_CALLID, op->call_id);
~~~

----------

So it sounds like one of two things needs to happen:
  - The schema needs to accept an integer here rather than only a nonNegativeInteger, OR
  - We need to use something else instead of -1 for the call ID (example logic in get_fake_call_id in controld_execd.c comes to mind).


It looks like the schema has only existed for 4 months (https://github.com/ClusterLabs/pacemaker/commit/c718c879), while we've been setting op->call_id to -1 for 10 years (https://github.com/ClusterLabs/pacemaker/blame/master/daemons/controld/controld_te_actions.c#L247), so I suspect we just need to change a line of the crm_mon schema.

(BTW, I don't know if this is intentional or not -- crm_mon.rng references api/crm_mon-2.3.rng rather than api/crm_mon-2.4.rng).

@clumens, can you take a look at this?

Comment 3 Reid Wahl 2021-02-22 22:18:49 UTC

(In reply to Reid Wahl from comment #2)
> It looks like the schema has only existed for 4 months
> (https://github.com/ClusterLabs/pacemaker/commit/c718c879)

More like a year and a half.
  - https://github.com/ClusterLabs/pacemaker/commit/51da3460

I was looking at the wrong schema version. I got confused because both the crm_resource-2.x.rng and crm_mon-2.x.rng files in xml/api contain call as a nonNegativeInteger.

It looks like this may have been a regression in 2.0.3?

Comment 4 Chris Lumens 2021-02-22 22:22:45 UTC

> It looks like the schema has only existed for 4 months
> (https://github.com/ClusterLabs/pacemaker/commit/c718c879), while we've been
> setting op->call_id to -1 for 10 years
> (https://github.com/ClusterLabs/pacemaker/blame/master/daemons/controld/
> controld_te_actions.c#L247), so I suspect we just need to change a line of
> the crm_mon schema.

Yeah, we should just change that to not be a nonNegativeInteger.  I just didn't know a negative value was allowed there.  If it can be negative in other places, might as well change all those at the same time.  I don't know if this requires making new schema files or if we can just fix the ones that already exist.  These files went into public releases.

> (BTW, I don't know if this is intentional or not -- crm_mon.rng references
> api/crm_mon-2.3.rng rather than api/crm_mon-2.4.rng).

Pretty sure that should always reference the latest version of the schema.  api/crm_mon.rng is there to validate the output of "crm_mon --as-xml", which has different outermost XML tags than "crm_mon --output-as xml" does.

Comment 8 Ken Gaillot 2021-03-01 17:04:51 UTC

(In reply to Reid Wahl from comment #2)
> So it sounds like one of two things needs to happen:
>   - The schema needs to accept an integer here rather than only a
> nonNegativeInteger, OR
>   - We need to use something else instead of -1 for the call ID (example
> logic in get_fake_call_id in controld_execd.c comes to mind).

Or both.

The schema change is the most important, and sufficient for this bz's purposes.

We use -1 for the call ID for pending operations as well (in daemons/controld/controld_execd.c:record_pending_op()), and that's a reasonable value there. I think it would actually make more sense to use unsigned call IDs, with 0 for pending and positive for completed, but the current design is reasonable and probably not worth changing for compatibility reasons.

However I'm not sure using either -1 in controld_record_action_timeout() or an incremented ID in get_fake_call_id() is ideal. I would think controld_record_action_timeout() is susceptible to the issue described in get_fake_call_id(), but I'm not sure why it's never been reported as an issue in practice. On the other hand, I would think get_fake_call_id() could possibly result in a duplicate call ID with the next successful call from the same executor connection; that's likely not a problem because (IIRC) if call IDs are identical then timestamps are used for comparing two history entries. In any case, we can't use the logic from get_fake_call_id() in controld_record_action_timeout() because the executor connection may not be available at that point.

Bottom line, I'm not sure what should be done about the -1 itself, but the schema change is definitely worthwhile and sufficient for this bz.

Comment 9 Ken Gaillot 2021-03-01 17:06:37 UTC

(In reply to Chris Lumens from comment #4)
> > (BTW, I don't know if this is intentional or not -- crm_mon.rng references
> > api/crm_mon-2.3.rng rather than api/crm_mon-2.4.rng).
> 
> Pretty sure that should always reference the latest version of the schema. 
> api/crm_mon.rng is there to validate the output of "crm_mon --as-xml", which
> has different outermost XML tags than "crm_mon --output-as xml" does.

The best fix for that would be to dynamically generate crm_mon.rng rather than have it static in the repo. Eventually we want to remove it, but that's a ways in the future ...

Comment 11 Chris Lumens 2021-03-02 16:59:58 UTC

Potential fix:  https://github.com/ClusterLabs/pacemaker/pull/2316

Comment 15 pkomarov 2021-03-09 11:26:19 UTC

Tested and verified

Comment 19 errata-xmlrpc 2021-05-18 15:26:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1782

Note You need to log in before you can comment on or make changes to this bug.