Red Hat Bugzilla – Bug 703653
Publican uses several GB when compiling docbook with a large number of invalid xrefs
Last modified: 2014-08-04 18:25:31 EDT
The attached file is an example of a docbook book that includes a large number of invalid xrefs. This was the result of a tool that attempted to link topics together. Admittedly there are "a lot" of invalid xrefs, but trying to compile the book consumes at least 4 GB of memory (if not closer to 6 or 7 GB), which will kill a system that doesn't have that memory free.
Created attachment 498180 [details]
publican build --langs=en-US --formats=html
on the attached book. Publican sucks up 4 or 5GBs of RAM.
By "kill" I mean "make so unresponsive that you are forced to reset".
Or, if you don't have a swapfile the kernel will kill the publican job. Running two builds of that book in parallel results in:
Beginning work on en-US
Even one invalid xref invalidates the DocBook XML, ensuring Publican won’t build the book.
Lots and lots of invalid xrefs just makes the XML that much more invalid.
Remove the invalid xrefs and the book will build and the memory consumption won’t occur and the system killing won’t happen.
Then it seems to me that reasonable behavior would be for Publican to fail at the first invalid xref and not to continue using system resources.
How about stopping at the first instance of invalidity, rather than killing my system?
It's either xmllint or xsltproc doing it. If a document has invalid links, xsltproc probably doesn't get called, because xmllint will fail it; so probably xmllint.
lol at the "Doing that hurts? Don't do that then" prescription.
FWIW this is probably XML::LibXML::Error spamming error nodes. Finding out where it's doing that and limiting it would take quite a lot of effort, so it's not likely this would get done any time soon.
AFAICT there is no option in LibXML to stop at the first error found, it's possible it might be catchable in XML::LibXML::Error, but again that is a significant development effort finding out where and how to do that so it would not happen in any reasonable time frame.
If this is having a significant impact then it'd be worth it for the people it's affecting to ask about these changes with the upstream XML::LibXML maintainer at CPAN.