Bug 1298018

Summary:	parsing empty file with XMLParser(recover=True) raises lxml.etree.XMLSyntaxError: Document is empty
Product:	Red Hat Enterprise Linux 7	Reporter:	Dan Callaghan <dcallagh>
Component:	libxml2	Assignee:	Daniel Veillard <veillard>
Status:	CLOSED NOTABUG	QA Contact:	qe-baseos-tools-bugs
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.2	CC:	jpopelka, lmiksik, ohudlick, tlavigne, veillard
Target Milestone:	rc	Keywords:	Regression
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-09-13 20:07:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Callaghan 2016-01-13 02:03:45 UTC

Description of problem:
lxml.etree.XMLParser has a keyword argument recover=True, documented as "try hard to parse through broken XML". One behaviour of that option, which our code was relying on, is that parsing an empty file should succeed and result in a tree with no root element. However on RHEL7 this is raising an exception instead.

Version-Release number of selected component (if applicable):
python-lxml-3.2.1-4.el7.x86_64
libxml2-2.9.1-6.el7_2.2.x86_64

How reproducible:
easily

Steps to Reproduce:
>>> import lxml.etree
>>> parser = lxml.etree.XMLParser(recover=True)
>>> lxml.etree.parse('/dev/null', parser)

Actual results:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 3197, in lxml.etree.parse (src/lxml/lxml.etree.c:64816)
  File "parser.pxi", line 1571, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:92729)
  File "parser.pxi", line 1600, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:93013)
  File "parser.pxi", line 1500, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:92076)
  File "parser.pxi", line 1047, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:88976)
  File "parser.pxi", line 577, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:84385)
  File "parser.pxi", line 676, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:85488)
  File "parser.pxi", line 616, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:84811)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

Expected results:
>>> import lxml.etree
>>> parser = lxml.etree.XMLParser(recover=True)
>>> tree = lxml.etree.parse('/dev/null', parser)
>>> print tree
<lxml.etree._ElementTree object at 0x7f0b5e658ea8>
>>> print tree.getroot()
None

Additional info:
It seems this regressed from libxml2-2.9.1-5.el7_1.2.x86_64.rpm to libxml2-2.9.1-6.el7_2.2.x86_64, which was a bunch of CVE fixes in libxml2. I'm not sure if lxml or libxml2 is at fault here, I've assigned this bug to python-lxml initially because that is where we are seeing the wrong behaviour from our side.

Comment 3 Jiri Popelka 2016-01-13 13:32:01 UTC

(In reply to Dan Callaghan from comment #0)
> It seems this regressed from libxml2-2.9.1-5.el7_1.2.x86_64.rpm to
> libxml2-2.9.1-6.el7_2.2.x86_64, which was a bunch of CVE fixes in libxml2.

Does it ring a bell Daniel ? (to me none of the CVEs' names look related)
We haven't updated python-lxml in RHEL-7 yet.
Moving to libxml2 as python-lxml is a low profile component and is unlikely to see any update.

Comment 5 Daniel Veillard 2016-09-13 20:07:06 UTC

  Hi that's Daniel Veillard author of libxml2

Seems you are parsing XML not HTML. Using recover is an abuse of the
spec and I threatened to remove it if people were using it casually
instead of just for data recovery in the event one accept data loss
or invalid data (which the XML spec goes to length to avoid as this was
a design goal). 

https://www.w3.org/TR/REC-xml/#NT-document

[1]   	document	   ::=   	prolog element Misc*

defines what an XML document is, you can't derive an empty string
from it (exercise left to the reader). So libxml2 *MUST* raise
a fatal error, lxml does accordingly, the error even seems absolutely
proper.

  You:
    1/ you are abusing a corner case of the libxml2 API
    2/ your document MUST raise a fatal error
    3/ libxml2/lxml does so

  => This is not a bug it's basic compliance to XML spec

 so closing accordingly, fix your software to not use recover by default
and second handle that exception as the parser is mandated by the spec
to raise it :-)
  The real bug is that somehow that error wasn't raised before,
just detect the empty string and never invoke the parser on it.


Daniel Veillard