1793653 – crm_mon legacy XML mode should print to stderr, and XML should indicate if fence history was not obtainable

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1793653 - crm_mon legacy XML mode should print to stderr, and XML should indicate if fence history was not obtainable

Summary: crm_mon legacy XML mode should print to stderr, and XML should indicate if fe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.3
Assignee:	Chris Lumens
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-21 18:35 UTC by Nina Hostakova
Modified:	2020-11-04 04:01 UTC (History)
CC List:	11 users (show)
Fixed In Version:	pacemaker-2.0.4-2.el8
Doc Type:	No Doc Update
Doc Text:	This will be invisible to most users.
Clone Of:
Clones:	1794062 (view as bug list)
Environment:
Last Closed:	2020-11-04 04:00:53 UTC
Type:	Enhancement
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nina Hostakova 2020-01-21 18:35:38 UTC

Description of problem:
When trying to start cluster with --wait on all nodes at the same time, cluster is 'Unable to get node status'. Frequency and number of nodes returning the error vary.

Version-Release number of selected component (if applicable):
pcs-0.10.4-3.el8.x86_64

How reproducible:
often (not always)

Steps to Reproduce:
1. stop the cluster
[root@virt-038 ~]# pcs cluster stop --all
virt-036: Stopping Cluster (pacemaker)...
virt-037: Stopping Cluster (pacemaker)...
virt-038: Stopping Cluster (pacemaker)...
virt-037: Stopping Cluster (corosync)...
virt-038: Stopping Cluster (corosync)...
virt-036: Stopping Cluster (corosync)...

2. run 'pcs cluster start --wait' on all nodes in parallel

> There are two types of errors occuring (sometimes combined)

A)
[root@virt-038 ~]# pcs cluster start --wait
Starting Cluster...
Waiting for node(s) to start...
Error: Unable to get node status: unable to get local node name from pacemaker: error: Could not connect to cluster (is it running?)

B)
[root@virt-037 ~]# pcs cluster start --wait
Starting Cluster...
Waiting for node(s) to start...
Error: Unable to get node status: cannot load cluster status, xml does not conform to the schema
 
3. check the nodes status afterwards
[root@virt-038 ~]# pcs status nodes
Pacemaker Nodes:
 Online: virt-036 virt-037 virt-038
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:

Actual results:
Some (all) of the nodes sometimes return one of the mentioned errors. Cluster nodes will start anyway. 

Expected results:
Starting of cluster passes without errors on all nodes.

Additional info:
This might suggest problem in pacemaker:

A)
[root@virt-036 ~]# pcs cluster start --wait --debug
...
Finished running: /usr/sbin/crm_node --name
Return value: 102
--Debug Stdout Start--

--Debug Stdout End--
--Debug Stderr Start--
error: Could not connect to cluster (is it running?)

--Debug Stderr End--

Error: Unable to get node status: unable to get local node name from pacemaker: error: Could not connect to cluster (is it running?)

B)
[root@virt-037 ~]# pcs cluster start --wait --debug
...
--Debug Stdout Start--
Connection to the cluster-daemons terminated
Reading stonith-history failed<crm_mon version="2.0.3"/>

--Debug Stdout End--
--Debug Stderr Start--
Critical: Unable to get stonith-history

--Debug Stderr End--

Error: Unable to get node status: cannot load cluster status, xml does not conform to the schema

This seems to be also related to bz1793574

Comment 1 Tomas Jelinek 2020-01-22 09:22:47 UTC

This seems like a pacemaker issue to me. Both error messages come from pacemaker and there is not much we can do about them in pcs.


(In reply to Nina Hostakova from comment #0)
> A)
> [root@virt-036 ~]# pcs cluster start --wait --debug
> ...
> Finished running: /usr/sbin/crm_node --name
> Return value: 102
> --Debug Stdout Start--
> 
> --Debug Stdout End--
> --Debug Stderr Start--
> error: Could not connect to cluster (is it running?)
> 
> --Debug Stderr End--
> 
> Error: Unable to get node status: unable to get local node name from
> pacemaker: error: Could not connect to cluster (is it running?)

This is coming from pacemaker, for some reason it is unable to figure out the local node's name. Not sure if this issue is new or it was waiting there for some time to be discovered. Nina can for sure provide pacemaker versions where this is happening.


> B)
> [root@virt-037 ~]# pcs cluster start --wait --debug
> ...
> --Debug Stdout Start--
> Connection to the cluster-daemons terminated
> Reading stonith-history failed<crm_mon version="2.0.3"/>
> 
> --Debug Stdout End--
> --Debug Stderr Start--
> Critical: Unable to get stonith-history
> 
> --Debug Stderr End--
> 
> Error: Unable to get node status: cannot load cluster status, xml does not
> conform to the schema

Unfortunately, QE snipped the output, so we cannot see the command that has been run and what was its return code. Nina, can you provide the missing info?

Anyway, it looks like crm_mon --as-xml mixes error / warning messages with xml output on stdout. And it may even exit with return code 0 in this case, which causes pcs to try parse the stdout as xml instead of exiting with an error. (This is a guess based on pcs code, since QE have not provided the command nor its return code.)


> This seems to be also related to bz1793574
How?

Comment 2 Ken Gaillot 2020-01-22 15:06:12 UTC

(In reply to Tomas Jelinek from comment #1)
> This seems like a pacemaker issue to me. Both error messages come from
> pacemaker and there is not much we can do about them in pcs.
> 
> 
> (In reply to Nina Hostakova from comment #0)
> > A)
> > [root@virt-036 ~]# pcs cluster start --wait --debug
> > ...
> > Finished running: /usr/sbin/crm_node --name
> > Return value: 102
> > --Debug Stdout Start--
> > 
> > --Debug Stdout End--
> > --Debug Stderr Start--
> > error: Could not connect to cluster (is it running?)
> > 
> > --Debug Stderr End--
> > 
> > Error: Unable to get node status: unable to get local node name from
> > pacemaker: error: Could not connect to cluster (is it running?)
> 
> This is coming from pacemaker, for some reason it is unable to figure out
> the local node's name. Not sure if this issue is new or it was waiting there
> for some time to be discovered. Nina can for sure provide pacemaker versions
> where this is happening.

crm_node contacts the local cluster to get the local node name (which may be different than the hostname), so if the cluster isn't running, it will give that error.

There was a change a while back, where it now contacts the pacemaker controller rather than corosync directly. That allows it to work reliably on remote nodes. But that does mean pacemaker as well as corosync must be running.

There's no reliable way to get the node name if the cluster isn't running, because node names don't have to match host names. If it's a full cluster node, you could probably figure it out from corosync.conf, but remote nodes don't have any way of knowing their own name (if it's different from their hostname) without the cluster connecting to them first.

> > B)
> > [root@virt-037 ~]# pcs cluster start --wait --debug
> > ...
> > --Debug Stdout Start--
> > Connection to the cluster-daemons terminated
> > Reading stonith-history failed<crm_mon version="2.0.3"/>
> > 
> > --Debug Stdout End--
> > --Debug Stderr Start--
> > Critical: Unable to get stonith-history
> > 
> > --Debug Stderr End--
> > 
> > Error: Unable to get node status: cannot load cluster status, xml does not
> > conform to the schema
> 
> Unfortunately, QE snipped the output, so we cannot see the command that has
> been run and what was its return code. Nina, can you provide the missing
> info?
> 
> Anyway, it looks like crm_mon --as-xml mixes error / warning messages with
> xml output on stdout. And it may even exit with return code 0 in this case,
> which causes pcs to try parse the stdout as xml instead of exiting with an
> error. (This is a guess based on pcs code, since QE have not provided the
> command nor its return code.)

The goal of crm_mon XML is to put everything, including error messages, into the XML. However that's not possible for certain early errors (e.g. argument processing) before the output format has been selected, and there are still a handful of later messages (including this one) that are going straight to stderr that really shouldn't be.

I'm not sure this should be labeled a "Critical" error, either. It just means crm_mon can't show the fencing section. So, I'm not sure whether that should result in an error exit status or not. It does highlight that the current XML output doesn't have any possibility of showing partial errors; maybe the <fence_history> tag should have an attribute like available="true"/"false" to tell the difference between "there's no fence history" and "we couldn't get fence history".

> > This seems to be also related to bz1793574
> How?

I believe this is unrelated.

Comment 3 Ken Gaillot 2020-01-22 15:12:10 UTC

To summarize action items:

1. I think the first error is something pcs needs to address. crm_node is correctly giving an error when the cluster is not running.

2. We can use this BZ to redirect as many error messages as possible to the XML output, and add something like the "available" suggestion to indicate partial failures. pcs would likely need to be modified to look for the new information.

I'll clone this BZ for the pcs end.

Comment 4 Ken Gaillot 2020-01-22 15:15:38 UTC

(In reply to Tomas Jelinek from comment #1)
> (In reply to Nina Hostakova from comment #0)
> > Error: Unable to get node status: cannot load cluster status, xml does not
> > conform to the schema
> 
> Unfortunately, QE snipped the output, so we cannot see the command that has
> been run and what was its return code. Nina, can you provide the missing
> info?

We do still need this info to check whether there's anything missing in the schema. The package version is important, too.

Comment 6 Chris Lumens 2020-02-19 15:50:04 UTC

A patch is in the works.  See https://github.com/ClusterLabs/pacemaker/pull/1990.

Comment 7 Ken Gaillot 2020-02-27 21:11:14 UTC

Tomas, do you want the "available" suggestion from the end of Comment 2 for pcs use? I.e. fence_history in the crm_mon XML output would get an available=yes/no attribute to be able to distinguish "couldn't get fence history" from "no fence history". If that distinction isn't important to you, we won't bother with that part.

Comment 8 Ken Gaillot 2020-02-27 21:20:41 UTC

Tomas, one point worth mentioning:

With the fix, using either --as-xml or --output-as=xml when an error occurs will get you a nonzero exit status, and nothing will be printed to stderr. However there will be a difference, --output-as=xml will give you a "status" element at the end with an error message, whereas --as-xml won't give you anything further.

Comment 9 Tomas Jelinek 2020-02-28 11:48:43 UTC

(In reply to Ken Gaillot from comment #7)
> Tomas, do you want the "available" suggestion from the end of Comment 2 for
> pcs use? I.e. fence_history in the crm_mon XML output would get an
> available=yes/no attribute to be able to distinguish "couldn't get fence
> history" from "no fence history". If that distinction isn't important to
> you, we won't bother with that part.

It's not important for pcs right now. However, I think it's the right way to deal with this issue. And the attribute may come very handy in the future. I vote for implementing it even if it won't be done right now.



(In reply to Ken Gaillot from comment #8)
> Tomas, one point worth mentioning:
> 
> With the fix, using either --as-xml or --output-as=xml when an error occurs
> will get you a nonzero exit status, and nothing will be printed to stderr.
> However there will be a difference, --output-as=xml will give you a "status"
> element at the end with an error message, whereas --as-xml won't give you
> anything further.

Actually, printing to stderr is not an issue for pcs and it's actually welcome. Printing plaintext to stdout and returning 0 is an issue. Currently, pcs works like this:
* Run 'crm_mon --as-xml'.
* If return code == 1, exit with an error "unable to get cluster status" including crm_mon's stderr and stdout. This gives a comprehensible message and includes detail information coming from pacemaker.
* If return code == 0, parse crm_mon's stdout as xml to get the cluster status.

If the error only goes to the xml, we must change the logic in pcs to be able to get the error and be compatible with before-fix and after-fix pacemaker. Something like:
* If crm_mon doesn't support --output-as=xml, use the logic described above.
* Else:
* Run 'crm_mon --output-as=xml'.
* If return code == 0, parse crm_mon's stdout as xml to get the cluster status.
* If return code == 1, and crm_mon's stdout is xml, parse it and get an error message from there and exit with the error.
* If return code == 1, and crm_mon's stdout is not xml, get an error message from stderr (even if it's empty) and exit with the error.

I'm trying to figure out what the pacemaker change means for pcs. Let me know if I got it right. Thanks.

Comment 10 Ken Gaillot 2020-02-28 16:15:25 UTC

(In reply to Tomas Jelinek from comment #9)
> (In reply to Ken Gaillot from comment #7)
> > Tomas, do you want the "available" suggestion from the end of Comment 2 for
> > pcs use? I.e. fence_history in the crm_mon XML output would get an
> > available=yes/no attribute to be able to distinguish "couldn't get fence
> > history" from "no fence history". If that distinction isn't important to
> > you, we won't bother with that part.
> 
> It's not important for pcs right now. However, I think it's the right way to
> deal with this issue. And the attribute may come very handy in the future. I
> vote for implementing it even if it won't be done right now.

Sounds good, we'll do it with this bz


> (In reply to Ken Gaillot from comment #8)
> > Tomas, one point worth mentioning:
> > 
> > With the fix, using either --as-xml or --output-as=xml when an error occurs
> > will get you a nonzero exit status, and nothing will be printed to stderr.
> > However there will be a difference, --output-as=xml will give you a "status"
> > element at the end with an error message, whereas --as-xml won't give you
> > anything further.
> 
> Actually, printing to stderr is not an issue for pcs and it's actually
> welcome. Printing plaintext to stdout and returning 0 is an issue.

I believe for XML everything will go to stdout, but it will return nonzero on error.

> Currently, pcs works like this:
> * Run 'crm_mon --as-xml'.
> * If return code == 1, exit with an error "unable to get cluster status"
> including crm_mon's stderr and stdout. This gives a comprehensible message
> and includes detail information coming from pacemaker.

I hope you mean nonzero, crm_mon can return a variety of nonzero exit statuses (in RHEL 8, guaranteed to follow what's given in "crm_error --list --exit").

With the new code, the only thing "crm_mon --as-xml" will print in an error situation will be "<crm_mon version="2.0.3"/>" (to stdout). It might not be too difficult to have it print a separate message to stderr, we'll look into that.

> * If return code == 0, parse crm_mon's stdout as xml to get the cluster
> status.
> 
> If the error only goes to the xml, we must change the logic in pcs to be
> able to get the error and be compatible with before-fix and after-fix
> pacemaker. Something like:
> * If crm_mon doesn't support --output-as=xml, use the logic described above.
> * Else:
> * Run 'crm_mon --output-as=xml'.
> * If return code == 0, parse crm_mon's stdout as xml to get the cluster
> status.
> * If return code == 1, and crm_mon's stdout is xml, parse it and get an
> error message from there and exit with the error.

That's correct (though again any nonzero is error).

> * If return code == 1, and crm_mon's stdout is not xml, get an error message
> from stderr (even if it's empty) and exit with the error.

Sounds good, though FYI in almost all cases the stdout will be xml -- the only exception would be if e.g. there was an invalid argument on the command line (at that point crm_mon doesn't yet know the desired output format and defaults to readable text to stderr).

> I'm trying to figure out what the pacemaker change means for pcs. Let me
> know if I got it right. Thanks.

Yes, that all sounds good except any nonzero exit is error

Comment 11 Chris Lumens 2020-02-28 16:38:07 UTC

> With the new code, the only thing "crm_mon --as-xml" will print in an error
> situation will be "<crm_mon version="2.0.3"/>" (to stdout). It might not be
> too difficult to have it print a separate message to stderr, we'll look into
> that.

It looks pretty easy to print errors to stderr for the legacy xml output.  The only question is whether to print them before, after, or in place of the rest of the XML.

Comment 12 Ken Gaillot 2020-02-28 16:50:39 UTC

(In reply to Chris Lumens from comment #11)
> > With the new code, the only thing "crm_mon --as-xml" will print in an error
> > situation will be "<crm_mon version="2.0.3"/>" (to stdout). It might not be
> > too difficult to have it print a separate message to stderr, we'll look into
> > that.
> 
> It looks pretty easy to print errors to stderr for the legacy xml output. 
> The only question is whether to print them before, after, or in place of the
> rest of the XML.

It doesn't really matter since they're different channels. stderr is usually unbuffered while stdout is usually buffered, so the order they're printed isn't necessarily the order they're seen, and in the case of pcs and most other scripts, they'll be saved separately anyway. Whatever's easiest :)

Comment 13 Tomas Jelinek 2020-03-02 08:33:35 UTC

Oh yes, I meant nonzero. All 'return code == 1' should have been 'return code != 0'. The actual code is checking for zero / nonzero.

Comment 14 Ken Gaillot 2020-03-09 23:35:55 UTC

QA:

This bug relates to crm_mon's XML output. There are two flavors, crm_mon --as-xml generates "legacy XML" and crm_mon --output-as=xml generates the "new" XML. The output is identical except for their outermost tag (<crm_mon> for legacy, <pacemaker-result> for new) plus the new XML also has a <status> element at the end with the command's exit status code.

Previously, crm_mon could intermix error messages with legacy XML on stdout. An example error is trying to run crm_mon when the cluster is not running. After the fix, error messages should go to stderr while the XML goes to stdout. (With the new XML, the error messages are encoded in the XML itself, with the exception of very early errors such as incorrect argument usage.)

Separately, part of what crm_mon checks is the cluster's fencing history. It is possible for the fencing history query to fail while the rest of the cluster status is obtained correctly. Previously, this would result in an error message, but looking at just the XML output, there was no way to distinguish a failed fencing history query from an empty fencing history. After the fix, the <fence_history> element will have a "status" attribute with an exit code if there is an error in the fencing history query. This should be reproducible by killing pacemaker-fenced then immediately running crm_mon (before the cluster can respawn the fencer).

Comment 15 Ken Gaillot 2020-03-17 14:19:56 UTC

Fixed in upstream master branch. Commit 3986112 handles separating regular output and errors (and encoding more errors in XML), commit 78c6edd handles the fence history status in XML.

Comment 16 Patrik Hagara 2020-03-23 09:45:23 UTC

qa_ack+, reproducer in description and comment#14

Comment 19 Simon Foucek 2020-09-18 13:43:58 UTC

a) First part which checks if there are error messages with legacy XML on stdout.
Before fix:

First test:

>[root@virt-166 ~]# rpm -q pacemaker
>pacemaker-2.0.3-5.el8_2.1.x86_64
>[root@virt-166 ~]# pcs cluster stop --all
>virt-161: Stopping Cluster (pacemaker)...
>virt-166: Stopping Cluster (pacemaker)...
>virt-167: Stopping Cluster (pacemaker)...
>virt-161: Stopping Cluster (corosync)...
>virt-166: Stopping Cluster (corosync)...
>virt-167: Stopping Cluster (corosync)...
>[root@virt-166 ~]# crm_mon --as-xml> >(sed 's/^/OUT: /') 2> >(sed 's/^/ERR: /' >&2) | cat
>OUT: <crm_mon version="2.0.3"/>

Second test:

>[root@virt-166 ~]# pkill -KILL pacemaker-fence && crm_mon --as-xml> >(sed 's/^/OUT: /') 2> >(sed 's/^/ERR: /' >&2) | cat
>ERR: Critical: Unable to get stonith-history
>OUT: Connection to the cluster-daemons terminated
>OUT: Reading stonith-history failed<crm_mon version="2.0.3"/>

Result: Some errors are on stdout and not on stderr. There isnt only xml on stdout.


After fix:

First test:
>[root@virt-242 ~]# rpm -q pacemaker
>pacemaker-2.0.4-6.el8.x86_64
>[root@virt-242 ~]# pcs cluster stop --all
>virt-242: Stopping Cluster (pacemaker)...
>virt-243: Stopping Cluster (pacemaker)...
>virt-242: Stopping Cluster (corosync)...
>virt-243: Stopping Cluster (corosync)...
>[root@virt-242 ~]# crm_mon --as-xml> >(sed 's/^/OUT: /') 2> >(sed 's/^/ERR: /' >&2) | cat
>ERR: Not connected
>ERR: Could not connect to the CIB: Transport endpoint is not connected
>ERR: crm_mon: Error: cluster is not available on this node
>OUT: <crm_mon version="2.0.4"/>

Second test:

>[root@virt-242 ~]# pkill -KILL pacemaker-fence && crm_mon --as-xml> >(sed 's/^/OUT: /') 2> >(sed 's/^/ERR: /' >&2) | cat
>ERR: Critical: Unable to get stonith-history
>ERR: Connection to the cluster-daemons terminated
>ERR: Reading stonith-history failed
>OUT: <crm_mon version="2.0.4"/>

Result: All errors are on stderr and there is only xml on stdout.


b) Second part which checks if there are <error> elements about fence history.
Before fix:

>[root@virt-166 ~]# rpm -q pacemaker
>pacemaker-2.0.3-5.el8_2.1.x86_64
>[root@virt-166 ~]# pkill -KILL pacemaker-fence && crm_mon --output-as=xml> >(sed 's/^/OUT: /') 2> >(sed 's/^/ERR: /' >&2) | cat
>ERR: Critical: Unable to get stonith-history
>OUT: Connection to the cluster-daemons terminated
>OUT: Reading stonith-history failed<pacemaker-result api-version="2.0" request="crm_mon --output-as=xml">
>OUT:   <status code="0" message="OK"/>
>OUT: </pacemaker-result>

Result: There arent any <error> elements inside <status> about fence history.

After fix:

>[root@virt-242 ~]# rpm -q pacemaker
>pacemaker-2.0.4-6.el8.x86_64
>[root@virt-242 ~]# pkill -KILL pacemaker-fence && crm_mon --output-as=xml> >(sed 's/^/OUT: /') 2> >(sed 's/^/ERR: /' >&2) | cat
>OUT: <pacemaker-result api-version="2.2" request="crm_mon --output-as=xml">
>OUT:   <status code="0" message="OK">
>OUT:     <errors>
>OUT:       <error>Critical: Unable to get stonith-history</error>
>OUT:       <error>Connection to the cluster-daemons terminated</error>
>OUT:       <error>Reading stonith-history failed</error>
>OUT:     </errors>
>OUT:   </status>
>OUT: </pacemaker-result>

Result: There are <error> elements inside <status> element about fence history.

Comment 22 errata-xmlrpc 2020-11-04 04:00:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4804

Note You need to log in before you can comment on or make changes to this bug.