Hide Forgot
Description of problem: lxml.etree.XMLParser has a keyword argument recover=True, documented as "try hard to parse through broken XML". One behaviour of that option, which our code was relying on, is that parsing an empty file should succeed and result in a tree with no root element. However on RHEL7 this is raising an exception instead. Version-Release number of selected component (if applicable): python-lxml-3.2.1-4.el7.x86_64 libxml2-2.9.1-6.el7_2.2.x86_64 How reproducible: easily Steps to Reproduce: >>> import lxml.etree >>> parser = lxml.etree.XMLParser(recover=True) >>> lxml.etree.parse('/dev/null', parser) Actual results: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lxml.etree.pyx", line 3197, in lxml.etree.parse (src/lxml/lxml.etree.c:64816) File "parser.pxi", line 1571, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:92729) File "parser.pxi", line 1600, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:93013) File "parser.pxi", line 1500, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:92076) File "parser.pxi", line 1047, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:88976) File "parser.pxi", line 577, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:84385) File "parser.pxi", line 676, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:85488) File "parser.pxi", line 616, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:84811) lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1 Expected results: >>> import lxml.etree >>> parser = lxml.etree.XMLParser(recover=True) >>> tree = lxml.etree.parse('/dev/null', parser) >>> print tree <lxml.etree._ElementTree object at 0x7f0b5e658ea8> >>> print tree.getroot() None Additional info: It seems this regressed from libxml2-2.9.1-5.el7_1.2.x86_64.rpm to libxml2-2.9.1-6.el7_2.2.x86_64, which was a bunch of CVE fixes in libxml2. I'm not sure if lxml or libxml2 is at fault here, I've assigned this bug to python-lxml initially because that is where we are seeing the wrong behaviour from our side.
(In reply to Dan Callaghan from comment #0) > It seems this regressed from libxml2-2.9.1-5.el7_1.2.x86_64.rpm to > libxml2-2.9.1-6.el7_2.2.x86_64, which was a bunch of CVE fixes in libxml2. Does it ring a bell Daniel ? (to me none of the CVEs' names look related) We haven't updated python-lxml in RHEL-7 yet. Moving to libxml2 as python-lxml is a low profile component and is unlikely to see any update.
Hi that's Daniel Veillard author of libxml2 Seems you are parsing XML not HTML. Using recover is an abuse of the spec and I threatened to remove it if people were using it casually instead of just for data recovery in the event one accept data loss or invalid data (which the XML spec goes to length to avoid as this was a design goal). https://www.w3.org/TR/REC-xml/#NT-document [1] document ::= prolog element Misc* defines what an XML document is, you can't derive an empty string from it (exercise left to the reader). So libxml2 *MUST* raise a fatal error, lxml does accordingly, the error even seems absolutely proper. You: 1/ you are abusing a corner case of the libxml2 API 2/ your document MUST raise a fatal error 3/ libxml2/lxml does so => This is not a bug it's basic compliance to XML spec so closing accordingly, fix your software to not use recover by default and second handle that exception as the parser is mandated by the spec to raise it :-) The real bug is that somehow that error wasn't raised before, just detect the empty string and never invoke the parser on it. Daniel Veillard