Bug 734986 - Handling of entities in topics.
Summary: Handling of entities in topics.
Alias: None
Product: Topic Tool
Classification: Other
Component: cli-Topic_Tool
Version: 0.0.x
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
Assignee: Stephen Gordon
QA Contact:
Depends On:
TreeView+ depends on / blocked
Reported: 2011-09-01 04:47 UTC by Stephen Gordon
Modified: 2012-01-20 21:06 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2012-01-20 21:06:01 UTC

Attachments (Terms of Use)

Description Stephen Gordon 2011-09-01 04:47:07 UTC
Description of problem:

The topic tool, quite intentionally, doesn't handle entities. The reason being that you can't guarantee that a given entity will be defined wherever a topic happens to be used.

That said currently even common entities like   and &emdash;, which are supported by the DocBook DTD, result in an error from the validation process too.

The most likely solutions are to:

a) Inform the validator of the location of the DTD without physically adding it to the file.

b) Detecting the root element (which we do anyway) and building the proper doctype tag based on that entity.

I've verified that the second works, and in fact is the workaround if you need to do it manually now, but I think the first is a better solution if I can make it work as well.

Version-Release number of selected component (if applicable):


Additional info:

[sgordon@zugzug tmp]$ topic import Migration_Prerequisites.xml References/RHEV/2.3/Migration/
Is this topic intended for use more than once within a single document or set? If not then choose (L)ink Targetable topic. Otherwise choose (R)eusable topic. Access (H)elp for further information. (L/R/H): 
[Fatal Error] Migration_Prerequisites.xml:41:88: The entity "nbsp" was referenced, but not declared.
ERROR: Unable to parse 'Migration_Prerequisites.xml' (The entity "nbsp" was referenced, but not declared.).

Comment 1 Stephen Gordon 2011-09-22 01:30:57 UTC
Working on this. I have been fudging about with the Xerces library for a while. The constraints on any solution appear to be:

- Regardless of any settings regarding validation or entity expansion, Xerces still appears to need to be able to resolve entities.
- We don't want to set a doctype in the topics themselves, allowing for instance quick switching between RocBook and DocBook DTDs, for example.

The easiest way to handle this that I can see is to inject the DTD into the XML files on export. The DTD should default to DocBook 4.5 but be configurable via topic.conf. I had already partially done this for the xmllint check on input but we need more granular specification of the DTD values.

I did also try using Xerces' ability to validate using a Schema but it appears the DocBook 4.5 Schema is not functionally equivalent to the DTD.

Comment 2 Stephen Gordon 2011-09-22 04:26:25 UTC
As an additional spanner in the works the getDoctype() function of the Document can't be used to determine if this processing is needed, the exceptions relating to entities occur as soon as the document is parsed.

Any manipulation of the DOCTYPE will have to be done manually using normal file I/O.

Comment 3 Matthew Casperson 2011-10-11 01:02:02 UTC
This might be of some help: https://bugzilla.redhat.com/show_bug.cgi?id=739466#c2

Comment 4 Stephen Gordon 2011-12-06 21:48:10 UTC
Nothing like coming back to something after a long while to have another look... After some consideration what I have elected to do is update the `topic import` logic to attempt to ensure some standardization of the doctype declarations we use in topics - based on what I have seen this is inline with what comes out of skynet (though I still need to set encoding properly yet).

Before attempting to load the topics DOM for 'proper' parsing the topic tool now ensures that the XML at least has:

1) The XML preamble, '<?xml version="1.0"?>' (ultimately this should also include an encoding value, not in the initial commit though).
2) A valid doctype declaration, by default for DocBook 4.5 though this remains configurable as a result of other changes made previously.

For existing topics this has no impact* as to be created in the first place it must have been parseable without knowing to much about the doctype, but for new topics this means that we can now use entities as defined in the DocBook 4.5 DTD.

The only side-effect of all this is that the xmllint check had to be moved back a bit, the initial DTD we set in the code is invalid in that it does not specify the correct root node of the topic. Rather than manually parse this as well we get this information automatically because this change gives the automated parser enough information on the DTD that it can fill in the blanks. 

The output of this (and the other XML manipulations we perform) once it comes back out is what is now run through xmllint. The upside of this is that the validation check is now against *exactly* the same XML as what we intend to check in. 

* Ultimately I would like to go through and clean these up, but I won't be doing this until we've actually pushed a version of the tool that contains this change.

Comment 5 Stephen Gordon 2011-12-06 21:56:12 UTC
Committed initial change, I still want to:

1) Test that there is no negative impact on other flows now that entities are appearing in topics - particular topic export.
2) Add checks to ensure that not only the default preamble being added contains the encoding attribute but that existing preambles that don't have it get it added to them.
3) Re-factor the import logic completely - this change was way harder than it should have been as a result of the residual cruft we have in here.

Committed revision 75993.

Comment 6 Stephen Gordon 2011-12-07 19:17:27 UTC
The import logic now ensures that the encoding attribute in the preamble is set. 

If a preamble is provided which sets the encoding to UTF-8 or UTF-16 then it is left as is. If no preamble is specified or the preamble specified is for a different type of encoding then it is overridden and UTF-8 is used.

Note that this only impacts the logic used for parsing the file being imported. Once it's parsed we serialize it as UTF-8 always and the preamble of the file once stored in the topic repository reflects this.

Committed revision 76104.

Comment 7 Stephen Gordon 2011-12-07 19:19:14 UTC
Some rudimentary testing of the topic export logic has been done, so moving to modified. More wide ranging testing required before release.

Comment 8 Stephen Gordon 2012-01-20 21:06:01 UTC

Note You need to log in before you can comment on or make changes to this bug.