Bug 703653 - Publican uses several GB when compiling docbook with a large number of invalid xrefs
Publican uses several GB when compiling docbook with a large number of invali...
Product: Publican
Classification: Community
Component: publican (Show other bugs)
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Brian Forte
Ruediger Landmann
Depends On:
  Show dependency treegraph
Reported: 2011-05-10 20:04 EDT by Matthew Casperson
Modified: 2014-08-04 18:25 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2011-05-11 00:41:46 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Sample docbook (728.05 KB, application/zip)
2011-05-10 20:05 EDT, Matthew Casperson
no flags Details

  None (edit)
Description Matthew Casperson 2011-05-10 20:04:36 EDT
The attached file is an example of a docbook book that includes a large number of invalid xrefs. This was the result of a tool that attempted to link topics together. Admittedly there are "a lot" of invalid xrefs, but trying to compile the book consumes at least 4 GB of memory (if not closer to 6 or 7 GB), which will kill a system that doesn't have that memory free.
Comment 1 Matthew Casperson 2011-05-10 20:05:25 EDT
Created attachment 498180 [details]
Sample docbook
Comment 2 Joshua Wulf 2011-05-10 20:21:17 EDT
To reproduce:

publican build --langs=en-US --formats=html

on the attached book. Publican sucks up 4 or 5GBs of RAM.
Comment 3 Matthew Casperson 2011-05-10 20:31:19 EDT
By "kill" I mean "make so unresponsive that you are forced to reset".
Comment 4 Joshua Wulf 2011-05-10 20:53:41 EDT
Or, if you don't have a swapfile the kernel will kill the publican job. Running two builds of that book in parallel results in:

Beginning work on en-US
Validation failed: 
Comment 5 Brian Forte 2011-05-11 00:41:46 EDT
Even one invalid xref invalidates the DocBook XML, ensuring Publican won’t build the book.

Lots and lots of invalid xrefs just makes the XML that much more invalid.

Remove the invalid xrefs and the book will build and the memory consumption won’t occur and the system killing won’t happen.
Comment 6 Misty Stanley-Jones 2011-05-11 00:56:49 EDT
Then it seems to me that reasonable behavior would be for Publican to fail at the first invalid xref and not to continue using system resources.
Comment 7 Joshua Wulf 2011-05-11 01:00:58 EDT

How about stopping at the first instance of invalidity, rather than killing my system?
Comment 8 Joshua Wulf 2011-05-11 01:08:17 EDT
It's either xmllint or xsltproc doing it. If a document has invalid links, xsltproc probably doesn't get called, because xmllint will fail it; so probably xmllint.
Comment 9 Joshua Wulf 2011-05-11 01:08:54 EDT
lol at the "Doing that hurts? Don't do that then" prescription.
Comment 10 Jeff Fearn 2011-05-11 03:16:49 EDT
FWIW this is probably XML::LibXML::Error spamming error nodes. Finding out where it's doing that and limiting it would take quite a lot of effort, so it's not likely this would get done any time soon.

AFAICT there is no option in LibXML to stop at the first error found, it's possible it might be catchable in XML::LibXML::Error, but again that is a significant development effort finding out where and how to do that so it would not happen in any reasonable time frame.

If this is having a significant impact then it'd be worth it for the people it's affecting to ask about these changes with the upstream XML::LibXML maintainer at CPAN.

Cheers, Jeff.

Note You need to log in before you can comment on or make changes to this bug.