Hide Forgot
Description of problem: An XML file with entities loaded via a DTD in XML::LibXML is not loaded, causing a missing entities error. Version-Release number of selected component (if applicable): libxml2.x86_64 0:2.7.6-8.el6_3.3 perl-XML-LibXML-1.70-5.el6.x86_64 How reproducible: #!/usr/bin/perl use XML::LibXML; print XML::LibXML->VERSION, "\n"; open(my $fh, "<", "lib/lang/en/phrases/system.xml"); warn "parser"; my $parser = XML::LibXML->new; warn "expand_entities"; $parser->expand_entities( 1 ); warn "parse_fh"; $parser->parse_fh( $fh, "lib/" ); Plus: http://trac.eprints.org/eprints/export/7899/branches/3.3/system/lib/lang/en/phrases/system.xml http://trac.eprints.org/eprints/export/7899/branches/3.3/system/lib/entities.dtd Actual results: Missing entity 'nbsp'. Expected results: (no output) Additional info: This is a reduced test-case from EPrints, which has been broken by the current release of libxml2. Downgrading to the previous version of libxml2 or installing the vanilla 2.7.8 (with RHEL patch for library compat) resolves the issue. There is no problem apparent with "xmllint" (which is even stranger, because it looks like XML::LibXML does the same thing as xmllint).
Would it be possible to change the Severity of this to "Urgent" considering if we would yum update our servers it would break at least all of our web sites that use this?
libxml2 is not fetching external entities by default, this is a security concious default processing model. If you trust the parsed XML then you can activate loading of entities, at the C level this is given as one of the parser options: http://xmlsoft.org/html/libxml-parser.html#xmlParserOption the option is XML_PARSE_DTDLOAD = 4 : load the external subset note that missing entities should not be fatal errors at the parser level when the external subset exists and is not loaded, it is warnings only, the parser will continue to the end and deliver as much data as it can. If you activate DTDLOAD I strongly suggest to also add NONET if all the resources are supposed to be on the local machine too. I don't know how to pass those options to the Perl bindings though, but I think this is possible, hope this helps, Daniel
XML::LibXML sets DTDLOAD by default. This is a regression rooted somewhere in the RHEL backport patches against libxml2 2.7.6. I've added a unit test in XML::LibXML that demonstrates the bug: https://bitbucket.org/timbrody/perl-xml-libxml/commits/9189140762edc64474d8153fe18dd8198e0295b3 The bug occurs with parse_fh() (xmlCreatePushParserCtxt) and not parse_string (xmlIOParseDTD). Compiling XML::LibXML against libxml 2.7.8 on RHEL 6 passes all unit tests: hg clone https://timbrody@bitbucket.org/timbrody/perl-xml-libxml cd perl-xml-libxml wget ftp://xmlsoft.org/libxml2/libxml2-2.7.8.tar.gz tar xzf libxml2-2.7.8.tar.gz export ROOT=`pwd` pushd libxml2-2.7.8 ./configure --prefix=$ROOT/usr make && make install popd XMLPREFIX=`pwd`/usr perl Makefile.PL; LD_LIBRARY_PATH=`pwd`/usr/lib make test vs. perl Makefile.PL; make test ... Entity: line 2: parser error : Entity 'foo' not defined <X>&foo;</X>
> XML::LibXML sets DTDLOAD by default. urgh ... horribly dangerous I still don't understand what entry point XML::LibXML is using where you create a parser with XML_PARSE_DTDLOAD and then that parser does not load the external subset. Because if I test with all parser methods (push, tree, reader) the behaviour is consistent thinkpad:~/XML -> xmllint --push --noout system.xml system.xml:396: parser error : Entity 'nbsp' not defined age or title page. If there are more than four authors, click on the [More ^ system.xml:396: parser error : Entity 'nbsp' not defined e page. If there are more than four authors, click on the [More input ^ system.xml:3006: parser error : Entity 'nbsp' not defined g" border="0" class="ep_required" alt="Required" style="display: inline"/> ^ thinkpad:~/XML -> xmllint --loaddtd --push --noout system.xml thinkpad:~/XML -> xmllint --loaddtd --noout system.xml thinkpad:~/XML -> xmllint --loaddtd --stream system.xml thinkpad:~/XML -> xmllint --push does use xmlCreatePushParserCtxt() which doesn't take options and then calls xmlCtxtUseOptions(ctxt, options) to set the options on the newly created context. if XML::LibXML calls xmlCreatePushParserCtxt() but doesn't call xmlCtxtUseOptions() then i doubt the XML_PARSE_DTDLOAD option will be set, Daniel
Here's the relevant XS code: https://timbrody@bitbucket.org/timbrody/perl-xml-libxml/src/9189140762edc64474d8153fe18dd8198e0295b3/LibXML.xs?at=default#cl-1749 That calls LibXML_init_parser, which invokes xmlCtxtUseOptions with parserOptions. parserOptions comes from Perl space and defaults to: ( XML_PARSE_NODICT | XML_PARSE_HUGE | XML_PARSE_DTDLOAD | XML_PARSE_NOENT ) Calling xmlCtxtUseOptions(ctxt, XML_PARSE_DTDLOAD ); just before the parse occurs (as expected) makes no difference. But this is just repeating what I opened the bug with - I already know the same process works in xmllint. XML::LibXML worked before the 0:2.7.6-8.el6_3.3 release and works with 2.7.8. There's something wrong with the RHEL patches.
I've created a C unit test here: https://github.com/timbrody/libxml_entity_bug But my xmllint is now also failing with push-parser: $ rpm -q libxml2 libxml2-2.7.6-12.el6_4.1.x86_64 $ xmllint --version xmllint: using libxml version 20706 compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib $ xmllint --loaddtd --noout test.xml $ xmllint --push --loaddtd --noout test.xml test.xml:4: parser error : Entity 'nbsp' not defined ^ vs. a compiled 2.7.8 on the same system ./xmllint --push --loaddtd --noout /home/user/libxml/test.xml ./xmllint --version /root/libxml/libxml2-2.7.8/.libs/lt-xmllint: using libxml version 20708 compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib I've done some limited debugging of libxml2 and it looks like it is loading the entities but then isn't finding them in ->sax->getEntity(). So perhaps something about ctxt->userData ?
My 2c: I have this problem too and in my case it appeared after an upgrade. Since the original libxml 2.7.6 compiled from sources works fine on the same server, this is definitively an issue of the RH patches applied to the plain vanilla lib. I tried to recompile from SRPMS and then I reverted the RH patches one-by-one; so I discovered that the patch named "libxml2-More-fixups-on-the-push-parser-behaviour.patch" makes the difference. Without that patch the lib works fine again - but no idea what it exactly does, that part of code was ever a real mess for me (sorry - no offense intended) May be you may find such a info useful...
Created attachment 754119 [details] patch the parser to restore external DTD parsing capability Well, I think I found what's wrong. The is a small typo in parser.c, patch is attached; should be applied after RH patches.
I can confirm that R.Scussat's patch appears to resolve the issue, although I needed to fix some whitespace to get it to apply.
Created attachment 755228 [details] patch the parser to restore external DTD parsing capability Well, Tim Brody is right. My bad. The refined version is attached. I forgot to mention it should be applied to the source package libxml2-2.7.6-12.el6_4.1.src.rpm.
Okay, thanks for the research of a proper patch, I will look to see if this matches an upstream commit (my recollection is yes but more places were changed). I will try to do this this week, thanks ! Daniel
Okay , I was able to find the problem by reproducing. That's something which was actually fixed in git upstream in October last year: https://git.gnome.org/browse/libxml2/commit/?id=6c91aa384f48ff6d406553a6dd47fd556c1ef2e6 corresponding to upstream bug https://bugzilla.gnome.org/show_bug.cgi?id=684774 Daniel
Created attachment 757221 [details] Backport patch for rhel-6.5 branch This is the backported patch from upstream. Just a small conflict in the backport, and solves the problem exposed by xmllint --push --loaddtd --noout test.xml Daniel
Created attachment 757268 [details] modified patch Your patch is obviously ok, but it doesn't apply cleanly (Hunk #1 FAILED at 11473). There are some stray spacing involved (some lines of original code are aligned by simple spaces, others with TABs, others are mixed). The attached reworked patch fixes this issue and applies cleanly on source package libxml2-2.7.6-12.el6_4.1.src.rpm.
Generated rpms for testing using my patch (working for me !), they are not official and didn't go though QE but could be useful for testing and working around the issue, they are at: ftp://xmlsoft.org/libxml2/test/ -rw-r--r-- 1 veillard www 4890057 Jun 5 22:52 libxml2-2.7.6-14.el6.src.rpm -rw-rw-r-- 1 veillard www 1818796 Jun 5 23:02 libxml2-2.7.6-14.el6.x86_64.rpm -rw-rw-r-- 1 veillard www 1417513 Jun 5 23:02 libxml2-devel-2.7.6-14.el6.x86_64.rpm -rw-rw-r-- 1 veillard www 491354 Jun 5 23:02 libxml2-python-2.7.6-14.el6.x86_64.rpm -rw-rw-r-- 1 veillard www 702171 Jun 5 23:02 libxml2-static-2.7.6-14.el6.x86_64.rpm Daniel
libxml2-2.7.6-14.el6 has been made with the fix, Daniel
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1737.html
In Linux box following packages installed , libxml2-devel-2.7.6-14.el6.x86_64 libxml2-2.7.6-14.el6.x86_64 libxml2-python-2.7.6-14.el6.x86_64 During XML parsing the DTD entities are removing, Below my program example and the output. Please help urgent on this basics XMl file name :test.xml <?xml version="1.0" encoding="utf-8"?> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body> <div style="background-color:#ff0000">&nbsp;Jum&nbsp;</div> </body> </note> File name : readxml.php <?php $parser=xml_parser_create(); function char($parser,$data){ echo "<br/>".$data; } xml_set_character_data_handler($parser,"char"); $fp=fopen("test.xml","rt"); while ($data=fread($fp,4096)){ xml_parse($parser,$data,feof($fp)) or die (sprintf("XML Error: %s at line %d", xml_error_string(xml_get_error_code($parser)), xml_get_current_line_number($parser))); } xml_parser_free($parser); ?> Current Output : div style="background-color:#ff0000"nbsp;Jumnbsp;/div Excepted Output : <div style="background-color:#ff0000"> Jum </div>