Bug 216780

Summary:	Cryptic XML messages from system-config-cluster
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Robert Peterson <rpeterso>
Component:	redhat-config-cluster	Assignee:	Jim Parsons <jparsons>
Status:	CLOSED NOTABUG	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	cluster-maint
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-11-27 16:02:21 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Robert Peterson 2006-11-21 23:22:59 UTC

Description of problem:
I have a cluster.conf I've been using for a long time on my RHEL4U4
cluster.  It may not be perfect, but the cluster runs.  However,
if I go into the system-config-cluster gui, I get a bunch of
error messages that are so cryptic I can't tell what's wrong
with the cluster.conf.  By the way, it's not so different from
the cluster.conf documented in the usage.txt file.

Version-Release number of selected component (if applicable):
RHEL4U4

How reproducible:
Always

Steps to Reproduce:
Use this cluster.conf:

<?xml version="1.0"?>
<cluster config_version="1" name="bobs_roth">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="1800"/>
        <clusternodes>
                <clusternode name="roth-01" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="roth-apc" switch="1"
port="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="roth-02" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="roth-apc" switch="1"
port="2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="roth-03" votes="1">
                        <fence>
                                <method name="single">
                                        <device name="roth-apc" switch="1"
port="3"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="10.15.87.34" login="apc"
name="roth-apc" passwd="apc"/>
                <fencedevice agent="fence_baytech" ipaddr="10.15.87.33"
login="admin" name="baytech" passwd=""/>
                <fencedevice agent="fence_brocade" ipaddr="10.1.1.2"
login="user" name="brocade1" passwd="pw"/>
                <fencedevice agent="fence_brocade" ipaddr="10.1.1.3"
login="user" name="brocade2" passwd="pw"/>
                <fencedevice agent="fence_manual" name="human"/>
                <fencedevice agent="fence_cwigrid" name="iGrid"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="igridnodes1" ordered="1"
restricted="1">
                                <failoverdomainnode name="roth-01" priority="1"/>
                                <failoverdomainnode name="roth-02" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="igridnodes2" ordered="1"
restricted="1">
                                <failoverdomainnode name="roth-02" priority="1"/>
                                <failoverdomainnode name="roth-01" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="10.15.84.250" monitor_link="1"/>
                        <ip address="10.15.84.251" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="igridnodes1" name="10.15.84.251">
                        <ip ref="10.15.84.251"/>
                </service>
                <service autostart="1" domain="igridnodes2" name="10.15.84.250">
                        <ip ref="10.15.84.250"/>
                </service>
        </rm>
</cluster>

Go into the system-config-cluster gui.
  
Actual results:
/etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error :
Expecting an element gulm, got nothing
/etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error : Invalid
sequence in interleave
/etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error : Element
cluster failed to validate content
/etc/cluster/cluster.conf:8: element device: validity error : IDREF attribute
name references an unknown ID "roth-apc"
/etc/cluster/cluster.conf:15: element device: validity error : IDREF attribute
name references an unknown ID "roth-apc"
/etc/cluster/cluster.conf:22: element device: validity error : IDREF attribute
name references an unknown ID "roth-apc"
/etc/cluster/cluster.conf fails to validate

Expected results:
How about telling me what I did wrong?  Or what it doesn't like?

Additional info:
It's easy to blow me off and say "He doesn't know what he'd doing" so
let's examine each of these messages one by one from a cluster admin's
point of view:

1. /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error :
Expecting an element gulm, got nothing

This is a dlm cluster.  It has nothing to do with gulm.  It doesn't say
what it really expected.  Is it expecting "lock_proto=lock_gulm"?
As far as I know, cluster.conf shouldn't need to know the lock proto
because GFS stores this info in the superblock, and the init scripts
take care of loading the necessary kernel modules.

2. /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error :
Invalid sequence in interleave

What's an "invalid sequence" to your typical sysadmin?
What does Relax-NH mean?  Nothing to me, but I'm not an xml expert.
Interleave doesn't appear anywhere in the cluster.conf file.  
So again, what does it not like?  What "sequence" is wrong?

3. /etc/cluster/cluster.conf:2: element cluster: Relax-NG validity error :
Element cluster failed to validate content

Validity error tells me absolutely nothing.  I get that it means
line 2, and the "cluster" tag.  But what doesn't it like about line 2?

4. /etc/cluster/cluster.conf:8: element device: validity error : IDREF attribute
name references an unknown ID "roth-apc"

Well, IDREF doesn't appear in cluster.conf.  "Name" appears on almost
every line.  Line 8's "name" specifies roth-apc, alright, but I don't
see anything wrong, but I'm not an xml parser either.
It'd be nice if it told me what it didn't like.  Like "fence device
roth-apc not found."  But it looks to be there to my untrained eyes.

5. /etc/cluster/cluster.conf:15: element device: validity error : IDREF
attribute name references an unknown ID "roth-apc"

Same as 4.

6. /etc/cluster/cluster.conf:22: element device: validity error : IDREF
attribute name references an unknown ID "roth-apc"
/etc/cluster/cluster.conf fails to validate

Same as 4.

I fully realize that we're just using an xml parser here, but to
users this is about as cryptic as humanly possible.  If it were in
Japanese, Hungarian, or Thai it would have made as much sense to me.

I also realize that we're heading away from s-c-c and onto conga,
and we definitely have higher priorities, but still...

Perhaps we need an xml-parser-to-sysadmin translation layer.

Comment 1 Jim Parsons 2006-11-27 16:02:21 UTC

There is really no reason to be condescending here (which I feel your bug
description reeks of) for an error in a configuration file that you built by
hand, and then expect s-c-cluster to tell you what you did wrong. 
1) If you had built the original conf file in s-c-cluster, this error would not
exist.
2) Before reading a configuration file, s-c-cluster uses a RelaxNG schema file
to validate that the incoming file does not contain krud. The command used is:
'xmllint --relaxng cluster.ng cluster.conf' where cluster.ng is a file included
in the distribution in the src/misc directory. In fact, you can run this command
on your own. 
3) RelaxNG is much more lightweight a mechanism for checking xml validity than
the only other option which is XML Schema. That is a nightmare to maintain, and
the error messages it produces are as cryptic as the ones relaxng mode in
xmllint produces. One advantage to using RelaxNG, is that the guy who wrote and
maintains it works for Red Hat, and his error messages do improve with every
release...so, it is not the GUI tool producing the cryptic messages, it is the
external validation checker.
4) When a conf file fails, the relaxng error messages should be treated like a
compile or link error, where the FIRST error message is what needs to be
explored first. In the case of this file, the 'expecting gulm...' message is the
clue. Since you are not using GuLM, then there should be a cman tag, even if it
is an empty one: <cman/>. This is a requirement in the schema for
cluster.conf...if a file will work in a rhel4 cluster without either a gulm or
cman tag as a cman cluster, then this is a lucky accident - as it violates the
xml schema for cluster.conf.

If you add the <cman/> tag under <cluster> your errors will go away.

If you have a nasty failure of your schema, one quick way to troubleshoot it is
run xmllint directly and comment out huge chunks of the file to quickly localize
what section is actually giving you the issue - such as, everything under the
<rm> tag...then begin refining what is commented out until you find the true
culprit --- or just write the conf file in s-c-cluster from the get go.