Bug 475250 - perl-XML-SAX does not properly decode UTF-8 characters
perl-XML-SAX does not properly decode UTF-8 characters
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: perl-XML-SAX (Show other bugs)
All Linux
low Severity medium
: rc
: ---
Assigned To: Marcela Mašláňová
: Regression
Depends On:
  Show dependency treegraph
Reported: 2008-12-08 12:23 EST by Trevin Beattie
Modified: 2013-04-12 15:59 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-01-06 03:57:09 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Perl script which demonstrates the bug (546 bytes, application/x-perl)
2008-12-08 12:24 EST, Trevin Beattie
no flags Details
Input file for the test script (160 bytes, text/xml)
2008-12-08 12:25 EST, Trevin Beattie
no flags Details

  None (edit)
Description Trevin Beattie 2008-12-08 12:23:40 EST
Description of problem:
I have a Perl module which reads a language translation file in XML with UTF-8 encoding and performs string substitution on a template.  The module uses XML::Simple, which in turn utilizes XML::SAX.  Our current production systems are running RHEL 3 with perl-XML-Simple-2.14-2 and perl-XML-SAX-0.12-1 (the latter built in-house).  In testing our web site on RHEL 5.3 beta with perl-XML-Simple-2.14-4.fc6 and perl-XML-SAX-0.14-5, I see that all non-ASCII characters on the translated pages come out as garbage.

If I simply replace the Red Hat perl-XML-SAX-0.14 package with our older 0.12 package, the page then renders correctly.

However, I also have a system running Fedora Core 6 which is using the same perl-XML-Simple-2.14-4.fc6 package and perl-XML-SAX-0.14-2.  On this system the test code I wrote writes out correct UTF-8 characters, so there must be some subtle difference between FC6 and EL5 which would account for the error.

Version-Release number of selected component (if applicable):

How reproducible:
Every time.

Steps to Reproduce:
1. Run the attached Perl script as follows:
   test-XML-SAX-0.13.pl test-XML-SAX-0.13.xml test-XML-SAX-0.13.out
2. Examine the output file:
   cat test-XML-SAX-0.13.out
3. Check the encoding by passing the output file to hexdump:
   hexdump -C test-XML-SAX-0.13.out
Actual results (hexdump on left, output on right):
00000000  c3 a9 c2 83 c2 bd c3 a5  c2 b8 c2 82 0a           |都市.|

Expected results:
00000000  e9 83 bd e5 b8 82 0a                              |都市.|

Additional info:
This seems to have been broken at version 0.13 (introduced by update request in bug #176161):
Comment 1 Trevin Beattie 2008-12-08 12:24:44 EST
Created attachment 326150 [details]
Perl script which demonstrates the bug
Comment 2 Trevin Beattie 2008-12-08 12:25:14 EST
Created attachment 326151 [details]
Input file for the test script
Comment 3 Trevin Beattie 2008-12-10 16:38:34 EST
I upgraded perl-XML-SAX on my Fedora Core 6 system to version 0.14-5 from the Red Hat EL 5.3b distribution, and the test script still runs correctly.

I then upgraded perl itself to version 5.8.8-18.el5, and it still runs correctly.

Given that, I can't say whether the perl-XML-SAX package is the source of the bug.  As both systems now have the exact same perl, perl-XML-Simple, and perl-XML-SAX packages, and the latter packages are pure Perl code, what could possibly be different between FC6 and EL5 that would cause the script output to differ?
Comment 4 Marcela Mašláňová 2008-12-11 09:26:40 EST
Well, there were problems with scriptlets, which are installing ParserDetails.ini. Those scriptlets aren't in RHEL-5, but it is in FC-6. That's only difference between XML::SAX modules. The scriptlets are quite problematic, because they are probably reason of problematic updates from RHEL-4 to RHEL-5. I'll be working on fix.
Comment 5 Trevin Beattie 2008-12-11 11:24:42 EST
That must be a ghost file, because even though rpm says it belongs to perl-XML-SAX, it isn't part of the package's file listing.

I see this morning that the contents of the file are different between my two systems, but only in the order in which the sections are defined.  Here's the file from FC6:

http://xml.org/sax/features/namespaces = 1

http://xml.org/sax/features/namespaces = 1

http://xml.org/sax/features/namespaces = 1

And here is the file from EL5:

http://xml.org/sax/features/namespaces = 1

http://xml.org/sax/features/namespaces = 1

http://xml.org/sax/features/namespaces = 1

I did notice that after I downgraded perl-XML-SAX to version 0.12 and then upgraded back to 0.14 again, ParserDetails.ini had disappeared.  I noticed because my program would not run at all -- I got the error "could not find ParserDetails.ini in /usr/lib/perl5/vendor_perl/5.8.8/XML/SAX".  I had to completely remove and then re-install the package to fix that little problem.

I'm able to confirm that the order of entries in ParserDetails.ini *does* make a difference!  When I swapped this file between the two systems, the test script broke on FC6 and worked properly on EL5.
Comment 6 Marcela Mašláňová 2008-12-12 06:42:00 EST
So the missing file ParserDetails.ini is one bug filed in rhbz as #289061.

The utf8 problem was fixed in the latest version of XML::SAX as upstream bug http://rt.cpan.org/Public/Bug/Display.html?id=26588 It's regression to 0.12.
Comment 14 errata-xmlrpc 2010-01-06 03:57:09 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

Comment 15 Trevin Beattie 2010-01-12 18:38:27 EST
After upgrading perl-XML-LibXML-1.58-6.i386.rpm and perl-XML-SAX-0.14-8.noarch.rpm:

[tbeattie@admin tmp]$ ./test-XML-SAX-0.13.pl test-XML-SAX-0.13.xml test-XML-SAX-0.13.out
could not find ParserDetails.ini in /usr/lib/perl5/vendor_perl/5.8.8/XML/SAX

but the output file was correctly encoded.

After removing and cleanly re-installing the packages, the test script ran without any errors.

Note You need to log in before you can comment on or make changes to this bug.