192929 – iconv() converts into UCS-4 as little-endian

Bug 192929 - iconv() converts into UCS-4 as little-endian

Summary: iconv() converts into UCS-4 as little-endian

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	gcc3
Sub Component:
Version:	4.0
Hardware:	i586
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jakub Jelinek
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-05-24 10:16 UTC by Gregory Brodsky
Modified:	2007-11-30 22:07 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-05-24 14:34:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
C++ source to reproduce the problem (2.35 KB, text/plain) 2006-05-24 10:16 UTC, Gregory Brodsky	no flags	Details
View All

Description Gregory Brodsky 2006-05-24 10:16:16 UTC

Description of problem:

According to Unicode 4.0 Standard, UCS-4 is just an alias to UTF-32, and UCS-2
is just an alias of UTF-16. Therefore, endian-ness of them should be equal.

Really, UCS-2 and UTF-16 are both big-endian.

However, iconv() conversion into UCS-4 works differently from conversion into
UTF-32. Result of conversion into UTF-32 is big-endian, and result of conversion
into UCS-4 is little-endian.

I believe this is wrong. Especially, there is no any justification for
inconsistency between UTF-16 and UTF-32.

Please notice, I mean default names like "UCS-4", without suffixes like "BE".

Version-Release number of selected component (if applicable):

Found on RHEL WS v4, gcc v3.4.3.

How reproducible:
I attached a program to reproduce the problem. Not very commercial product
though :).


Steps to Reproduce:
1. The source gets result for UTF-32
2. To see result for UCS-4, replace the name and recompile.
3. g++ iconvtest.C; ./a.out
  
Actual results:

Result of conversion into UTF-32 is big-endian, and result of conversion into
UCS-4 is little-endian.

Expected results:

They both should be big-endian, like UTF-16 & UCS-2 are.

Additional info:

I was not sure to which component it should be assigned, sorry for that. If you
know a more correct person please forward the defect to him/her.

Comment 1 Gregory Brodsky 2006-05-24 10:16:16 UTC

Created attachment 129908 [details]
C++ source to reproduce the problem

Comment 2 Jakub Jelinek 2006-05-24 12:11:26 UTC

Can you cite why you think say UCS-2 is an alias for UTF-16?
Certainly http://www.unicode.org/reports/tr17/index.html
doesn't suggest anything like that, it has always been a different encoding.

Comment 3 Ulrich Drepper 2006-05-24 14:34:47 UTC

That's nonsense.  UCS-2 and UCS-4 are standalone encodings.

Comment 4 Gregory Brodsky 2006-05-25 10:13:09 UTC

(In reply to comment #2)
> Can you cite why you think say UCS-2 is an alias for UTF-16?
> Certainly http://www.unicode.org/reports/tr17/index.html
> doesn't suggest anything like that, it has always been a different encoding.
> 

You are correct, my opinion is based on different source.

Unicode v4.0 Standard book, page 1350:
"As a conseguence, UCS-4 can now be taken effectively as an alias for the
Unicode encoding form UTF-32...".

In page 1352, list of encodings: 
"UTF-8, UTF-16 or UCS-4 (=UTF-32)"

There is a similar statement about UTF-16 vs UCS-2, but I did not find an exact
citate.

Comment 5 Gregory Brodsky 2006-05-25 11:56:14 UTC

(In reply to comment #3)
> That's nonsense.  UCS-2 and UCS-4 are standalone encodings.

Well, but according to Unicode v4.0 Standard book, page 32, an endian order for
both of them is platform dependent. 

Please notice, since (unlike UTF-16 and UTF-32) UCS-2 and UCS-4 converted data
is generated witout BOM, customer does not have another way to expect their
endian order but by platform.

That's why I don't understand why endian order of UCS-2 and UCS-4 might be
different in the same system.

Note You need to log in before you can comment on or make changes to this bug.