1298018 – parsing empty file with XMLParser(recover=True) raises lxml.etree.XMLSyntaxError: Document is empty

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1298018 - parsing empty file with XMLParser(recover=True) raises lxml.etree.XMLSyntaxError: Document is empty

Summary: parsing empty file with XMLParser(recover=True) raises lxml.etree.XMLSyntaxEr...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libxml2
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Daniel Veillard
QA Contact:	qe-baseos-tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-13 02:03 UTC by Dan Callaghan
Modified:	2019-03-06 02:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-09-13 20:07:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Dan Callaghan 2016-01-13 02:03:45 UTC

Description of problem:
lxml.etree.XMLParser has a keyword argument recover=True, documented as "try hard to parse through broken XML". One behaviour of that option, which our code was relying on, is that parsing an empty file should succeed and result in a tree with no root element. However on RHEL7 this is raising an exception instead.

Version-Release number of selected component (if applicable):
python-lxml-3.2.1-4.el7.x86_64
libxml2-2.9.1-6.el7_2.2.x86_64

How reproducible:
easily

Steps to Reproduce:
>>> import lxml.etree
>>> parser = lxml.etree.XMLParser(recover=True)
>>> lxml.etree.parse('/dev/null', parser)

Actual results:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 3197, in lxml.etree.parse (src/lxml/lxml.etree.c:64816)
  File "parser.pxi", line 1571, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:92729)
  File "parser.pxi", line 1600, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:93013)
  File "parser.pxi", line 1500, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:92076)
  File "parser.pxi", line 1047, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:88976)
  File "parser.pxi", line 577, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:84385)
  File "parser.pxi", line 676, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:85488)
  File "parser.pxi", line 616, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:84811)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

Expected results:
>>> import lxml.etree
>>> parser = lxml.etree.XMLParser(recover=True)
>>> tree = lxml.etree.parse('/dev/null', parser)
>>> print tree
<lxml.etree._ElementTree object at 0x7f0b5e658ea8>
>>> print tree.getroot()
None

Additional info:
It seems this regressed from libxml2-2.9.1-5.el7_1.2.x86_64.rpm to libxml2-2.9.1-6.el7_2.2.x86_64, which was a bunch of CVE fixes in libxml2. I'm not sure if lxml or libxml2 is at fault here, I've assigned this bug to python-lxml initially because that is where we are seeing the wrong behaviour from our side.

Comment 3 Jiri Popelka 2016-01-13 13:32:01 UTC

(In reply to Dan Callaghan from comment #0)
> It seems this regressed from libxml2-2.9.1-5.el7_1.2.x86_64.rpm to
> libxml2-2.9.1-6.el7_2.2.x86_64, which was a bunch of CVE fixes in libxml2.

Does it ring a bell Daniel ? (to me none of the CVEs' names look related)
We haven't updated python-lxml in RHEL-7 yet.
Moving to libxml2 as python-lxml is a low profile component and is unlikely to see any update.

Comment 5 Daniel Veillard 2016-09-13 20:07:06 UTC

  Hi that's Daniel Veillard author of libxml2

Seems you are parsing XML not HTML. Using recover is an abuse of the
spec and I threatened to remove it if people were using it casually
instead of just for data recovery in the event one accept data loss
or invalid data (which the XML spec goes to length to avoid as this was
a design goal). 

https://www.w3.org/TR/REC-xml/#NT-document

[1]   	document	   ::=   	prolog element Misc*

defines what an XML document is, you can't derive an empty string
from it (exercise left to the reader). So libxml2 *MUST* raise
a fatal error, lxml does accordingly, the error even seems absolutely
proper.

  You:
    1/ you are abusing a corner case of the libxml2 API
    2/ your document MUST raise a fatal error
    3/ libxml2/lxml does so

  => This is not a bug it's basic compliance to XML spec

 so closing accordingly, fix your software to not use recover by default
and second handle that exception as the parser is mandated by the spec
to raise it :-)
  The real bug is that somehow that error wasn't raised before,
just detect the empty string and never invoke the parser on it.


Daniel Veillard

Note You need to log in before you can comment on or make changes to this bug.