Bug 863166 - XML::LibXML doesn't load entities DTD [NEEDINFO]
XML::LibXML doesn't load entities DTD
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libxml2 (Show other bugs)
6.5
All All
medium Severity medium
: rc
: ---
Assigned To: Daniel Veillard
Miloš Prchlík
: Regression
Depends On:
Blocks: 835616
  Show dependency treegraph
 
Reported: 2012-10-04 11:07 EDT by Tim Brody
Modified: 2014-01-07 07:51 EST (History)
12 users (show)

See Also:
Fixed In Version: libxml2-2.7.6-14.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-21 19:23:27 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
p.jeyaprabhu: needinfo?


Attachments (Terms of Use)
patch the parser to restore external DTD parsing capability (382 bytes, patch)
2013-05-28 22:57 EDT, R.Scussat
no flags Details | Diff
patch the parser to restore external DTD parsing capability (384 bytes, patch)
2013-05-31 06:47 EDT, R.Scussat
no flags Details | Diff
Backport patch for rhel-6.5 branch (1.49 KB, patch)
2013-06-05 10:17 EDT, Daniel Veillard
no flags Details | Diff
modified patch (1.46 KB, patch)
2013-06-05 11:34 EDT, R.Scussat
no flags Details | Diff

  None (edit)
Description Tim Brody 2012-10-04 11:07:36 EDT
Description of problem:

An XML file with entities loaded via a DTD in XML::LibXML is not loaded, causing a missing entities error.

Version-Release number of selected component (if applicable):

libxml2.x86_64 0:2.7.6-8.el6_3.3
perl-XML-LibXML-1.70-5.el6.x86_64

How reproducible:

#!/usr/bin/perl

use XML::LibXML;

print XML::LibXML->VERSION, "\n";

open(my $fh, "<", "lib/lang/en/phrases/system.xml");

warn "parser";
my $parser = XML::LibXML->new;
warn "expand_entities";
$parser->expand_entities( 1 );
warn "parse_fh";
$parser->parse_fh( $fh, "lib/" );

Plus: http://trac.eprints.org/eprints/export/7899/branches/3.3/system/lib/lang/en/phrases/system.xml
http://trac.eprints.org/eprints/export/7899/branches/3.3/system/lib/entities.dtd

Actual results:

Missing entity 'nbsp'.

Expected results:

(no output)

Additional info:

This is a reduced test-case from EPrints, which has been broken by the current release of libxml2.

Downgrading to the previous version of libxml2 or installing the vanilla 2.7.8 (with RHEL patch for library compat) resolves the issue.

There is no problem apparent with "xmllint" (which is even stranger, because it looks like XML::LibXML does the same thing as xmllint).
Comment 2 Brian Gregg 2013-03-07 10:24:41 EST
Would it be possible to change the Severity of this to "Urgent" considering if we would yum update our servers it would break at least all of our web sites that use this?
Comment 3 Daniel Veillard 2013-03-07 20:55:34 EST
libxml2 is not fetching external entities by default, this is a security
concious default processing model. If you trust the parsed XML then you can
activate loading of entities, at the C level this is given as one of the
parser options:

http://xmlsoft.org/html/libxml-parser.html#xmlParserOption
the option is
XML_PARSE_DTDLOAD = 4 : load the external subset

note that missing entities should not be fatal errors at the parser level
when the external subset exists and is not loaded, it is warnings only, the
parser will continue to the end and deliver as much data as it can.

If you activate DTDLOAD I strongly suggest to also add NONET if all the
resources are supposed to be on the local machine too.

I don't know how to pass those options to the Perl bindings though, but
I think this is possible,

   hope this helps,

Daniel
Comment 4 Tim Brody 2013-03-08 08:22:59 EST
XML::LibXML sets DTDLOAD by default.

This is a regression rooted somewhere in the RHEL backport patches against libxml2 2.7.6.

I've added a unit test in XML::LibXML that demonstrates the bug:
https://bitbucket.org/timbrody/perl-xml-libxml/commits/9189140762edc64474d8153fe18dd8198e0295b3

The bug occurs with parse_fh() (xmlCreatePushParserCtxt) and not parse_string (xmlIOParseDTD). 

Compiling XML::LibXML against libxml 2.7.8 on RHEL 6 passes all unit tests:

hg clone https://timbrody@bitbucket.org/timbrody/perl-xml-libxml
cd perl-xml-libxml
wget ftp://xmlsoft.org/libxml2/libxml2-2.7.8.tar.gz
tar xzf libxml2-2.7.8.tar.gz
export ROOT=`pwd`
pushd libxml2-2.7.8
./configure --prefix=$ROOT/usr
make && make install
popd
XMLPREFIX=`pwd`/usr perl Makefile.PL; LD_LIBRARY_PATH=`pwd`/usr/lib make test

vs.

perl Makefile.PL; make test
...
Entity: line 2: parser error : Entity 'foo' not defined
<X>&foo;</X>
Comment 5 Daniel Veillard 2013-03-08 10:03:23 EST
> XML::LibXML sets DTDLOAD by default.

 urgh ... horribly dangerous

I still don't understand what entry point XML::LibXML is using
where you create a parser with XML_PARSE_DTDLOAD and then that
parser does not load the external subset. Because if I test with
all parser methods (push, tree, reader) the behaviour is consistent

thinkpad:~/XML -> xmllint --push --noout system.xml
system.xml:396: parser error : Entity 'nbsp' not defined
age or title page. If there are more than four authors, click on the [More&nbsp;
                                                                               ^
system.xml:396: parser error : Entity 'nbsp' not defined
e page. If there are more than four authors, click on the [More&nbsp;input&nbsp;
                                                                               ^
system.xml:3006: parser error : Entity 'nbsp' not defined
g" border="0" class="ep_required" alt="Required" style="display: inline"/>&nbsp;
                                                                               ^
thinkpad:~/XML -> xmllint --loaddtd --push --noout system.xml
thinkpad:~/XML -> xmllint --loaddtd --noout system.xml
thinkpad:~/XML -> xmllint --loaddtd --stream system.xml
thinkpad:~/XML ->

xmllint --push does use xmlCreatePushParserCtxt() which doesn't
take options and then calls xmlCtxtUseOptions(ctxt, options)
to set the options on the newly created context.
if XML::LibXML calls xmlCreatePushParserCtxt() but doesn't call
xmlCtxtUseOptions() then i doubt the XML_PARSE_DTDLOAD option will
be set,

Daniel
Comment 6 Tim Brody 2013-03-08 11:35:48 EST
Here's the relevant XS code:
https://timbrody@bitbucket.org/timbrody/perl-xml-libxml/src/9189140762edc64474d8153fe18dd8198e0295b3/LibXML.xs?at=default#cl-1749

That calls LibXML_init_parser, which invokes xmlCtxtUseOptions with parserOptions. parserOptions comes from Perl space and defaults to:
( XML_PARSE_NODICT | XML_PARSE_HUGE | XML_PARSE_DTDLOAD | XML_PARSE_NOENT )

Calling xmlCtxtUseOptions(ctxt, XML_PARSE_DTDLOAD ); just before the parse occurs (as expected) makes no difference.


But this is just repeating what I opened the bug with - I already know the same process works in xmllint. XML::LibXML worked before the 0:2.7.6-8.el6_3.3 release and works with 2.7.8. There's something wrong with the RHEL patches.
Comment 7 Tim Brody 2013-03-27 13:08:43 EDT
I've created a C unit test here:
https://github.com/timbrody/libxml_entity_bug

But my xmllint is now also failing with push-parser:

$ rpm -q libxml2
libxml2-2.7.6-12.el6_4.1.x86_64
$ xmllint --version
xmllint: using libxml version 20706
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib
$ xmllint --loaddtd --noout test.xml 
$ xmllint --push --loaddtd --noout test.xml 
test.xml:4: parser error : Entity 'nbsp' not defined
&nbsp;
      ^

vs. a compiled 2.7.8 on the same system
./xmllint --push --loaddtd --noout /home/user/libxml/test.xml
./xmllint --version
/root/libxml/libxml2-2.7.8/.libs/lt-xmllint: using libxml version 20708
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib


I've done some limited debugging of libxml2 and it looks like it is loading the entities but then isn't finding them in ->sax->getEntity(). So perhaps something about ctxt->userData ?
Comment 9 R.Scussat 2013-05-28 21:59:52 EDT
My 2c: I have this problem too and in my case it appeared after an upgrade.
Since the original libxml 2.7.6 compiled from sources works fine on the same server, this is definitively an issue of the RH patches applied to the plain vanilla lib.
I tried to recompile from SRPMS and then I reverted the RH patches one-by-one; so I discovered that the patch named "libxml2-More-fixups-on-the-push-parser-behaviour.patch" makes the difference. 
Without that patch the lib works fine again - but no idea what it exactly does, that part of code was ever a real mess for me (sorry - no offense intended)
May be you may find such a info useful...
Comment 10 R.Scussat 2013-05-28 22:57:13 EDT
Created attachment 754119 [details]
patch the parser to restore external DTD parsing capability

Well, I think I found what's wrong.
The is a small typo in parser.c, patch is attached; should be applied after RH patches.
Comment 12 Tim Brody 2013-05-31 06:04:04 EDT
I can confirm that R.Scussat's patch appears to resolve the issue, although I needed to fix some whitespace to get it to apply.
Comment 13 R.Scussat 2013-05-31 06:47:02 EDT
Created attachment 755228 [details]
patch the parser to restore external DTD parsing capability

Well, Tim Brody is right. My bad. The refined version is attached.
I forgot to mention it should be applied to the source package libxml2-2.7.6-12.el6_4.1.src.rpm.
Comment 15 Daniel Veillard 2013-06-04 17:12:32 EDT
Okay, thanks for the research of a proper patch, I will look to see
if this matches an upstream commit (my recollection is yes but more
places were changed). I will try to do this this week,

  thanks !

Daniel
Comment 16 Daniel Veillard 2013-06-05 10:15:39 EDT
Okay , I was able to find the problem by reproducing. That's something which
was actually fixed in git upstream in October last year:

https://git.gnome.org/browse/libxml2/commit/?id=6c91aa384f48ff6d406553a6dd47fd556c1ef2e6

corresponding to upstream bug 
https://bugzilla.gnome.org/show_bug.cgi?id=684774

Daniel
Comment 17 Daniel Veillard 2013-06-05 10:17:52 EDT
Created attachment 757221 [details]
Backport patch for rhel-6.5 branch

This is the backported patch from upstream. Just a small conflict
in the backport, and solves the problem exposed by
   xmllint --push --loaddtd --noout test.xml

Daniel
Comment 18 R.Scussat 2013-06-05 11:34:50 EDT
Created attachment 757268 [details]
modified patch

Your patch is obviously ok, but it doesn't apply cleanly (Hunk #1 FAILED at 11473). 
There are some stray spacing involved (some lines of original code are aligned by simple spaces, others with TABs, others are mixed).
The attached reworked patch fixes this issue and applies cleanly on source package libxml2-2.7.6-12.el6_4.1.src.rpm.
Comment 19 Daniel Veillard 2013-06-05 17:04:53 EDT
Generated rpms for testing using my patch (working for me !), they are not official and
didn't go though QE but could be useful for testing and working around the issue, they are at:
ftp://xmlsoft.org/libxml2/test/

-rw-r--r-- 1 veillard www 4890057 Jun  5 22:52 libxml2-2.7.6-14.el6.src.rpm
-rw-rw-r-- 1 veillard www 1818796 Jun  5 23:02 libxml2-2.7.6-14.el6.x86_64.rpm
-rw-rw-r-- 1 veillard www 1417513 Jun  5 23:02 libxml2-devel-2.7.6-14.el6.x86_64.rpm
-rw-rw-r-- 1 veillard www  491354 Jun  5 23:02 libxml2-python-2.7.6-14.el6.x86_64.rpm
-rw-rw-r-- 1 veillard www  702171 Jun  5 23:02 libxml2-static-2.7.6-14.el6.x86_64.rpm

Daniel
Comment 26 Daniel Veillard 2013-10-10 10:02:06 EDT
libxml2-2.7.6-14.el6 has been made with the fix,

Daniel
Comment 29 errata-xmlrpc 2013-11-21 19:23:27 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1737.html
Comment 30 Jeyaprabhu Palanichamy 2014-01-07 07:51:56 EST
In Linux box following packages installed ,

libxml2-devel-2.7.6-14.el6.x86_64
libxml2-2.7.6-14.el6.x86_64
libxml2-python-2.7.6-14.el6.x86_64


During XML parsing the DTD entities are removing, Below my program example and the output. Please help urgent on this basics



XMl file name :test.xml 

<?xml version="1.0" encoding="utf-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>
&lt;div style="background-color:#ff0000"&gt;&amp;nbsp;Jum&amp;nbsp;&lt;/div&gt;
</body>
</note>

File name : readxml.php

<?php
      $parser=xml_parser_create();
      
      function char($parser,$data){
	 echo "<br/>".$data;
      }

      xml_set_character_data_handler($parser,"char");

      $fp=fopen("test.xml","rt");

      while ($data=fread($fp,4096)){
	  xml_parse($parser,$data,feof($fp)) or
	  die (sprintf("XML Error: %s at line %d",
	  xml_error_string(xml_get_error_code($parser)),
	  xml_get_current_line_number($parser)));
      }

      xml_parser_free($parser);
?>


Current Output  :

    div style="background-color:#ff0000"nbsp;Jumnbsp;/div 

Excepted Output :

   <div style="background-color:#ff0000">&nbsp;Jum&nbsp;</div>

Note You need to log in before you can comment on or make changes to this bug.