Bug 1107936 - iconv and uconv gives different results when converting GB18030 encoded files
Summary: iconv and uconv gives different results when converting GB18030 encoded files
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: icu
Version: 29
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Eike Rathke
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-11 04:52 UTC by Peng Wu
Modified: 2018-11-28 08:44 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-19 11:47:15 UTC


Attachments (Terms of Use)
The original file (34 bytes, text/plain)
2014-06-11 04:55 UTC, Peng Wu
no flags Details
The iconv converted file (41 bytes, text/plain)
2014-06-11 04:56 UTC, Peng Wu
no flags Details
The uconv converted file (41 bytes, text/plain)
2014-06-11 04:57 UTC, Peng Wu
no flags Details


Links
System ID Priority Status Summary Last Updated
Sourceware 19575 None None None 2019-01-10 22:20:32 UTC

Description Peng Wu 2014-06-11 04:52:01 UTC
Recently we found that iconv and uconv gives different converted files, when converting GB18030 encoded files.

In the attachments, there are original file, iconv converted file and uconv converted file.

I used the following commands to convert the original file:
iconv -f GB18030 -t UTF-8 < origin.txt > iconv.txt
uconv -f GB18030 -t UTF-8 < origin.txt > uconv.txt

Why the iconv and uconv give different results?

Comment 1 Peng Wu 2014-06-11 04:55:37 UTC
Created attachment 907479 [details]
The original file

Comment 2 Peng Wu 2014-06-11 04:56:34 UTC
Created attachment 907480 [details]
The iconv converted file

Comment 3 Peng Wu 2014-06-11 04:57:14 UTC
Created attachment 907481 [details]
The uconv converted file

Comment 4 Fedora End Of Life 2015-05-29 12:05:05 UTC
This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 5 Mike FABIAN 2015-06-01 11:18:23 UTC
The problem still exists unchanged in Fedora 22.

Comment 6 Florian Weimer 2016-02-05 19:52:52 UTC
Which one is correct?  Emacs agrees with uconv.

The glibc mapping has these mappings in localedata/charmaps/GB18030:

% <UE78D>     /xa6/xd9         <Private Use>
% <UE78E>     /xa6/xda         <Private Use>
% <UE78F>     /xa6/xdb         <Private Use>
% <UE790>     /xa6/xdc         <Private Use>
% <UE791>     /xa6/xdd         <Private Use>
% <UE792>     /xa6/xde         <Private Use>
% <UE793>     /xa6/xdf         <Private Use>
% <UE794>     /xa6/xec         <Private Use>
% <UE795>     /xa6/xed         <Private Use>
% <UE796>     /xa6/xf3         <Private Use>
…
<UFE10>     /xa6/xd9         PRESENTATION FORM FOR VERTICAL COMMA
<UFE11>     /xa6/xdb         PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
<UFE12>     /xa6/xda         PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
<UFE13>     /xa6/xdc         PRESENTATION FORM FOR VERTICAL COLON
<UFE14>     /xa6/xdd         PRESENTATION FORM FOR VERTICAL SEMICOLON
<UFE15>     /xa6/xde         PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK
<UFE16>     /xa6/xdf         PRESENTATION FORM FOR VERTICAL QUESTION MARK
<UFE17>     /xa6/xec         PRESENTATION FORM FOR VERTICAL LEFT WHITE LENTICULAR BRACKET
<UFE18>     /xa6/xed         PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET
<UFE19>     /xa6/xf3         PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS

Wikipedia links to this XML file, which obviously agrees with the uconv output:

http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml

Comment 7 Carlos O'Donell 2016-02-05 20:54:27 UTC
(In reply to Florian Weimer from comment #6)
> Which one is correct?  Emacs agrees with uconv.

As with linguistics, they are both correct :-)

> The glibc mapping has these mappings in localedata/charmaps/GB18030:
> 
> % <UE78D>     /xa6/xd9         <Private Use>
> % <UE78E>     /xa6/xda         <Private Use>
> % <UE78F>     /xa6/xdb         <Private Use>
> % <UE790>     /xa6/xdc         <Private Use>
> % <UE791>     /xa6/xdd         <Private Use>
> % <UE792>     /xa6/xde         <Private Use>
> % <UE793>     /xa6/xdf         <Private Use>
> % <UE794>     /xa6/xec         <Private Use>
> % <UE795>     /xa6/xed         <Private Use>
> % <UE796>     /xa6/xf3         <Private Use>
> …
> <UFE10>     /xa6/xd9         PRESENTATION FORM FOR VERTICAL COMMA
> <UFE11>     /xa6/xdb         PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
> <UFE12>     /xa6/xda         PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL
> STOP
> <UFE13>     /xa6/xdc         PRESENTATION FORM FOR VERTICAL COLON
> <UFE14>     /xa6/xdd         PRESENTATION FORM FOR VERTICAL SEMICOLON
> <UFE15>     /xa6/xde         PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK
> <UFE16>     /xa6/xdf         PRESENTATION FORM FOR VERTICAL QUESTION MARK
> <UFE17>     /xa6/xec         PRESENTATION FORM FOR VERTICAL LEFT WHITE
> LENTICULAR BRACKET
> <UFE18>     /xa6/xed         PRESENTATION FORM FOR VERTICAL RIGHT WHITE
> LENTICULAR BRAKCET
> <UFE19>     /xa6/xf3         PRESENTATION FORM FOR VERTICAL HORIZONTAL
> ELLIPSIS

The GB 18030-2005 standard still-uses some private-use-area (PUA) code points for some idiograms. The above non-PUA code-points (which differ from the published standard) are correct for GB 18030-2005 compliance. The PUA code points, in Unicode 4.1 or newer, can be used as non-PUA equivalents. It is highly recommended that the Unicode 4.1 code-points be used for anyone mapping GB 18030-2005 to UTF-8 and is best-practice (see note below).

> Wikipedia links to this XML file, which obviously agrees with the uconv
> output:
> http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-
> 2000.xml

This is the old standard (confirmed by verifying /xa8/xbc still maps to the old PUA <UE7C7>, fixed in GB 18030-2005), but even in the old standard the above PUA code points are defined for the idiograms.

In summary:
- glibc supports GB 18030-2005 and contains corrections for the most recent version.
- Following best practice glibc converts those GB 18030-2005 idiograms that would have used PUA code points into their equilvalent non-PUA Unicode 4.1 code points.
- uconv uses the exact PUA code points as the standard suggests and this causes the difference, and is not recommended.

I recommend a bug be filed against uconv to follow best practice and use Unicode 4.1 code points to avoid the problematic PUA code points defined in the original standard.

Note this is the recommended practice in "CJKV Information Processing" by Dr. Ken Lunde, who is probably the world-leading expert on the topic.

Moving to icu.

Comment 8 Fedora End Of Life 2016-07-19 11:47:15 UTC
Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 9 Florian Weimer 2017-03-15 06:14:36 UTC
I believe this is still not fixed in icu.

Comment 10 Jan Kurik 2017-08-15 07:33:09 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 27 development cycle.
Changing version to '27'.

Comment 11 Ben Cotton 2018-11-27 18:35:44 UTC
This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30  Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora  'version' of '27'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 27 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 12 Peng Wu 2018-11-28 06:45:35 UTC
This bug still exists in Fedora 29.

Changed version to Fedora 29.


Note You need to log in before you can comment on or make changes to this bug.