Bug 1581150

Summary: 'pcs config' fails on 'crm_mon --one-shot --as-xml --inactive' outputting anything on stderr
Product: Red Hat Enterprise Linux 7 Reporter: Tomas Jelinek <tojeline>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: urgent    
Version: 7.5CC: cfeist, cluster-maint, cluster-qe, idevat, jpokorny, kgaillot, kwenning, omular, rsteiger, tojeline
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.9.165-1.el7 Doc Type: Bug Fix
Doc Text:
Cause: Pcs runs crm_mon to get status of a cluster in XML format. Crm_mon prints XML to standard output and warnings to standard error output. Consequence: Pcs mixes XML and warnings into one stream and is then unable to parse it as XML. Fix: Keep standard and error outputs separated in pcs. Result: Reading XML status of a cluster works.
Story Points: ---
Clone Of: 1578955 Environment:
Last Closed: 2018-10-30 08:06:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
proposed fix none

Description Tomas Jelinek 2018-05-22 08:54:04 UTC
+++ This bug was initially created as a clone of Bug #1578955 +++

Description of problem:
With certain configurations 'pcs config' is totally failing
claiming that xml doesn't conform to schema.

Version-Release number of selected component (if applicable):
pcs-0.9.162-5.el7

How reproducible:
100%

Steps to Reproduce:

This example was walked through on rhel-8.0 with pcs-0.9.164-1.el7
but should work with anything that is mangled via upgrade as
described in /usr/share/pacemaker/upgrade-*.xsl.

1. Setup a cluster with a stonith-resource named Fencing (e.g. the test-config cti would setup)

2. Add an attribute that would be altered via a cib upgrade

cibadmin --create --xml-text '<configuration> <resources> <primitive id="Fencing"> <instance_attributes id="Fencing-params"> <nvpair id="Fencing-pcmk_arg_map" name="pcmk_arg_map" value="domain:uname"/> </instance_attributes> </primitive> </resources> </configuration>'

3. issue 'pcs config'

Actual results:
  [root@node2 ~]# pcs config
  Cluster Name: sbd-rhel8
  Error: cannot load cluster status, xml does not conform to the schema

Expected results:
output of cluster-config


Additional info:

The misbehaviour is due to pcs taking both stderr & stdout from crm_mon making this mixture not conform with even the basic-structure of xml.

Suggested fix:
diff --git a/pcs/utils.py b/pcs/utils.py
index 1648a62..b928629 100644
--- a/pcs/utils.py
+++ b/pcs/utils.py
@@ -1947,7 +1947,8 @@ def getClusterState():
 
 # DEPRECATED, please use lib.pacemaker.live.get_cluster_status_xml in new code
 def getClusterStateXml():
-    xml, returncode = run(["crm_mon", "--one-shot", "--as-xml", "--inactive"])
+    xml, returncode = run(["crm_mon", "--one-shot", "--as-xml", "--inactive"],
+                          ignore_stderr=True)
     if returncode != 0:
         err("error running crm_mon, is pacemaker running?")
     return xml

Command to revert the addition of the attribute:
cibadmin --delete --xml-text '<nvpair id="Fencing-pcmk_arg_map"   name="pcmk_arg_map" />'

Check stderr output of crm_mon besides xml to be parsed by pcs:
  [root@node2 ~]# crm_mon --one-shot --as-xml --inactive >/dev/null
  Resource instance_attributes: Fencing-pcmk_arg_map: dropping pcmk_arg_map

Be aware that with pacemaker-2.0.x we have an increase in the major-version.
Coming with that we have substantial changes in the cluster-config.
To be able to cope with that situation a lot more stuff is mangled via /usr/share/pacemaker/upgrade-*.xsl.
So be prepared to meet configs creating these stderr outputs a lot more often.
When it is not possible to do the mangling automatically stderr will give hints how to adapt the cib manually.
Thus it might be considered to not just ignore those outputs on stderr but as well to show them in one or the other way.
Afaik there is no easy string-identification at the moment telling if manual intervention would be necessary.

Comment 1 Tomas Jelinek 2018-05-22 09:46:39 UTC
RHEL7 reproducer:

1. Set up a crm_mon wrapper:
# mv /usr/sbin/crm_mon /usr/sbin/crm_mon.original
# cat<<2EOF > /usr/sbin/crm_mon
#!/bin/sh
echo "This is a test warning" 1>&2
/usr/sbin/crm_mon.original $*
2EOF
# chmod a+x /usr/sbin/crm_mon

2. Run 'pcs config'

Actual results:
# pcs config
Cluster Name: rhel75
Error: cannot load cluster status, xml does not conform to the schema

Expected results:
output of cluster-config

Comment 2 Tomas Jelinek 2018-05-22 11:11:34 UTC
Created attachment 1439975 [details]
proposed fix

After fix:
* 'pcs config' works
* 'pcs status displays crm_mon's warnings':
# pcs status
Cluster name: rhel75

WARNINGS:
This is a test warning

Stack: corosync
Current DC: rh75-node2 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
...

Comment 4 Ivan Devat 2018-06-22 12:19:01 UTC
After Fix:

[ant ~] $ rpm -q pcs pcs-snmp
pcs-0.9.165-1.el7.x86_64
pcs-snmp-0.9.165-1.el7.x86_64

[ant ~] $ mv /usr/sbin/crm_mon /usr/sbin/crm_mon.original
[ant ~] $ cat<<2EOF > /usr/sbin/crm_mon
#!/bin/sh
echo "This is a test warning" 1>&2
/usr/sbin/crm_mon.original \$*
2EOF
[ant ~] $ chmod a+x /usr/sbin/crm_mon
[ant ~] $ pcs config
Cluster Name: zoo
Corosync Nodes:
Pacemaker Nodes:
 ant bee

Resources:

Stonith Devices:
 Resource: xvm-fencing (class=stonith type=fence_xvm)
  Attributes: pcmk_host_list="ant bee"
  Operations: monitor interval=60s (xvm-fencing-monitor-interval-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: zoo
 dc-version: 1.1.16-12.el7-94ff4df
 have-watchdog: false

Quorum:
  Options:

Comment 8 errata-xmlrpc 2018-10-30 08:06:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3066