Bug 1382038

Summary:	iconv: the iconv interface does not allow to specify normalization
Product:	[Fedora] Fedora	Reporter:	Nikos Mavrogiannopoulos <nmavrogi>
Component:	glibc	Assignee:	Carlos O'Donell <codonell>
Status:	CLOSED WORKSFORME	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	rawhide	CC:	arjun, codonell, dj, fweimer, jakub, law, mfabian, nmavrogi, pfrankli, siddhesh
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-08 13:15:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nikos Mavrogiannopoulos 2016-10-05 14:47:29 UTC

Description of problem:
The UTF8 and UTF16 encodings do not normally provide a unique encoding. There are multiple normalization forms (e.g., NFC, NFD) [0] which result to different encodings. Although older standards such as PKIX, and ASN.1 ask simply for UTF-8 and UTF-16 fields, in the new standards a normalization form is being mandated (e.g., see https://tools.ietf.org/html/rfc7512#section-5 ).

As such, the iconv_open() interface is not sufficient as it is to cover such applications. They can convert to UTF-16BE, but they have no way to specify UTF-16BE//NFC or alternatives. Moreover, the manual of iconv_open() does not mention the normalization form of its default output at all.

[0]. http://unicode.org/reports/tr15/


How that can be addressed:
 - The manual page of iconv_open mentions the default normalization form
 - iconv_open() allows an NF form to be specified

Comment 1 Florian Weimer 2016-10-05 14:51:35 UTC

Do you need a specific version of the Unicode standard for stability?  Then iconv will never be what you need because the plan is only to support the latest Unicode standard at the time of a glibc release.

Comment 2 Nikos Mavrogiannopoulos 2016-10-05 15:31:10 UTC

No I do not think that a specific version of the standard is required for the request above. As far as I understand a conversion from UTF-8 to UTF-16BE//NFC would have the same output for the same characters, so that is sufficient for me no matter of the underlying standard.

My main use case for this request is being able to convert UTF-8 input to a UTF-16BE string under the NFC rules (to be used as a password, -and thus the output encoding must reproducible).

Comment 3 Florian Weimer 2016-10-05 15:52:36 UTC

(In reply to Nikos Mavrogiannopoulos from comment #2)
> No I do not think that a specific version of the standard is required for
> the request above. As far as I understand a conversion from UTF-8 to
> UTF-16BE//NFC would have the same output for the same characters, so that is
> sufficient for me no matter of the underlying standard.

That's not correct.  New characters may do away with the need for using combining characters to represent some glyphs, and so NFC results change.

> My main use case for this request is being able to convert UTF-8 input to a
> UTF-16BE string under the NFC rules (to be used as a password, -and thus the
> output encoding must reproducible).

That's not going to work with glibc, sorry.  You need to record the specific version of Unicode/NFC to use and have support tables for that.  icu should provide this.

Comment 4 Florian Weimer 2016-10-05 15:59:02 UTC

Hmm, If this is the normalization that is used:

  http://www.unicode.org/reports/tr15/#Stability_of_Normalized_Forms

then it should be stable.

So while this is traditionally the domain of ICU, we might be able to support this in glibc.

Do you need this in iconv, or would a wchar_t *-to-wchar_t * conversion do the job as well?

Comment 5 Nikos Mavrogiannopoulos 2016-11-02 12:09:07 UTC

(In reply to Florian Weimer from comment #4)
> Hmm, If this is the normalization that is used:
> 
>   http://www.unicode.org/reports/tr15/#Stability_of_Normalized_Forms
> 
> then it should be stable.
> So while this is traditionally the domain of ICU, we might be able to
> support this in glibc.

As I see it glibc already supports character conversions to and from UTF-8. Without normalization that means that a modern library which has to conform to (any) standard which involved UTF-8 has no way to specify (or even know) the normalization of the output data.

> Do you need this in iconv, or would a wchar_t *-to-wchar_t * conversion do
> the job as well?

I use only iconv(), so I cannot talk about the other APIs.

Comment 6 Nikos Mavrogiannopoulos 2016-11-08 13:15:59 UTC

I no longer thing that the libc-provided APIs are reasonable for unicode processing. I am switching to libunistring.