Red Hat Bugzilla – Bug 825005
Some global entities still break topics
Last modified: 2014-08-04 18:27:13 EDT
I've cloned an existing bug that was marked CLOSED CURRENT RELEASE as I think it did indeed fix the cases that were listed in it. Today however using the csprocessor to build output containing a number of newly imported topics I received an error "ERROR: Topic doesn't have well-formed xml" in the compiler output.
I couldn't see anything obviously wrong with the topic content so I exported it and used xmllint to validate it against the DocBook 4.5 DTD, sure enough it passed as valid.
After a bit of investigation I found that the lines causing issues used ', which is the single quote. This in itself isn't a problem, after all I can just replace them with a single quote. What I would like to see however is:
1) A better error message. The topic was well-formed as far as the DocBook XML 4.5 DTD goes.
2) A review of how entities are handled, it seems like under the previous bug the most common cases were picked up but we still have entities that are valid in DocBook XML 4.5 but don't work via skynet.
The topic # was 7526 and the revision # exhibiting the issue is 97932.
Will look into it since I changed the entity handling recently due to another bug that wasn't logged where the "& will fail";" was picked up as an entity in:
private final String "This is an example that contains an ampersand & will fail";
As for the better error messages I'll have to take another look as the last time I looked the library we used didn't report any errors. So that means I'll have to find another that fits our requirements.
Also what version where you using? I'm just asking because I did originally miss values that used a # and fixed it in 0.24.2. In saying that though there is another underlying issue even if you are using 0.24.2.
On the note of the better error message, I found a way with our current library to return error messages, so the next version will contain something like:
ERROR: Topic doesn't have well-formed xml. The content of elements must consist of well-formed character data or markup.
or for missing tags:
ERROR: Topic doesn't have well-formed xml. The element type "para" must be terminated by the matching end-tag "</para>".
While the libraries do return the line numbers they are sometimes very inaccurate depending on the issue, so for now I'll leave that out.
Will have to talk to Matt on Monday about this since Xerces doesn't permit HTML entities when parsing. So I need to check if we should throw errors about it or just convert the HTML entities to XML entities. Since xmllint and other tools we use allow it I would say that the second option is the better.
Talked to Matt about using HTML entities in XML and he said that we should throw an error and encourage the correct XML entities be used.
Marking is as ON_QA since we are counting HTML Decimal Notation as invalid. You should use the string based version as it is more readable and that is the only format that works at this point in time with xerces.