Red Hat Bugzilla – Bug 987252
INVALID_CHARACTER_ER is not shown when XML editing but only provided in docbuilder error
Last modified: 2014-07-13 17:10:53 EDT
Description of problem:
When an invalid character ( & ) is in the XML text, the validation message is still 'The XML is well formed." but the rendered page does not display. It is hard to locate the problem right away. I only found out what was causing the problem when docbuilder displayed the error message:
Topic ID 15007
INFO: Topic URL
ERROR: This topic doesn't have well-formed xml. INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified. The processed XML is
An error message can displayed in XML editing, and user don't have to wait for build to find out the error.
An error message to tell users to change '&' to '&'
Hey Julie, can you provide the topic id this occurred for (and preferably a revision if possible)? As I've tried most variations I can think of and I can't replicate it. When I add an ampersand to the XML I got an error every time, so if I could see how it was used that would help.
For what it's worth these are the errors I got when trying it standalone and at the start, end and middle of a string:
1. parser error : EntityRef: expecting ';'
2. parser error : xmlParseEntityRef: no name
3. parser error : Entity 'production' not defined
(In reply to Lee Newson from comment #1)
> Hey Julie, can you provide the topic id this occurred for (and preferably a
> revision if possible)? As I've tried most variations I can think of and I
> can't replicate it. When I add an ampersand to the XML I got an error every
> time, so if I could see how it was used that would help.
> For what it's worth these are the errors I got when trying it standalone and
> at the start, end and middle of a string:
> 1. parser error : EntityRef: expecting ';'
> 2. parser error : xmlParseEntityRef: no name
> 3. parser error : Entity 'production' not defined
I just checked again and can't seem to replicate the error.
Topic ID 15007
You're right; I see all those error messages when I type a &.
Okay I was able to find out why you saw what you did. The problem isn't from the ampersand at all and instead is from using a HTML entity in XML content (in this case: ’). This is generally discouraged and hence why it's not implemented in the Java XML parser (Xerces), so I'm going to leave this open to see if we can get this error to show in the UI as well.
Looked into this more and the problem is that the XML spec (http://www.w3.org/TR/REC-xml/#sec-references) refers to the &#...; notation as a character reference and as such when it's converted to a DOM object it should be converted into a character, unlike an entity reference &...; which references some internal/external entity. So possibly what we should be doing is getting the server/csprocessor to convert the the character references to their actual character instead of trying to keep them as entity references.
Fixed in 1.8-SNAPSHOT build 201407020954
Character references (or html entities as they are sometime known) are now excluded from the entity escaping function, meaning that they will be resolved into an actual character by the xml parser.
To go with this our custom convertNodeToString function has also been updated to escape the reserved characters (this is also done by most xml serializers).
Confirmed that adding character references like
are rendered in the live preview correctly, and converted to their corresponding characters correctly when saved through the web ui and webdav.
Confirmed that topics with extended characters build and preview ok with Publican and csprocessor.
Character references build ok when they are added to the content spec directly (like with a section title or the spec product), but the references are not replaced with their associated character.
If we are replacing these entities in topics, it probably makes sense to extend this behaviour to content specs too.
The bigger question here is do we want users to use them in Content Specs, as it is not meant to be XML syntax and is supposed to be clear text?
(In reply to Lee Newson from comment #13)
> The bigger question here is do we want users to use them in Content Specs,
> as it is not meant to be XML syntax and is supposed to be clear text?
With regards to this, I know we have allowed some XML content in to the specs (ie entities), however I'd like to keep as much out as possible.
I was thinking about this more over the weekend and given it won't be visible to most users (with the exception of invalid specs), we might as well implement this.
Fixed in 1.8-SNAPSHOT build 201407070919
The content spec parser has been updated to resolve XML Character references when parsing.
Verified that a spec edited through the UI or created using csprocessor with character references were replaced as expected.