1382038 – iconv: the iconv interface does not allow to specify normalization

Bug 1382038 - iconv: the iconv interface does not allow to specify normalization

Summary: iconv: the iconv interface does not allow to specify normalization

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Carlos O'Donell
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-05 14:47 UTC by Nikos Mavrogiannopoulos
Modified:	2016-11-08 13:15 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-11-08 13:15:59 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nikos Mavrogiannopoulos 2016-10-05 14:47:29 UTC

Description of problem:
The UTF8 and UTF16 encodings do not normally provide a unique encoding. There are multiple normalization forms (e.g., NFC, NFD) [0] which result to different encodings. Although older standards such as PKIX, and ASN.1 ask simply for UTF-8 and UTF-16 fields, in the new standards a normalization form is being mandated (e.g., see https://tools.ietf.org/html/rfc7512#section-5 ).

As such, the iconv_open() interface is not sufficient as it is to cover such applications. They can convert to UTF-16BE, but they have no way to specify UTF-16BE//NFC or alternatives. Moreover, the manual of iconv_open() does not mention the normalization form of its default output at all.

[0]. http://unicode.org/reports/tr15/


How that can be addressed:
 - The manual page of iconv_open mentions the default normalization form
 - iconv_open() allows an NF form to be specified

Comment 1 Florian Weimer 2016-10-05 14:51:35 UTC

Do you need a specific version of the Unicode standard for stability?  Then iconv will never be what you need because the plan is only to support the latest Unicode standard at the time of a glibc release.

Comment 2 Nikos Mavrogiannopoulos 2016-10-05 15:31:10 UTC

No I do not think that a specific version of the standard is required for the request above. As far as I understand a conversion from UTF-8 to UTF-16BE//NFC would have the same output for the same characters, so that is sufficient for me no matter of the underlying standard.

My main use case for this request is being able to convert UTF-8 input to a UTF-16BE string under the NFC rules (to be used as a password, -and thus the output encoding must reproducible).

Comment 3 Florian Weimer 2016-10-05 15:52:36 UTC

(In reply to Nikos Mavrogiannopoulos from comment #2)
> No I do not think that a specific version of the standard is required for
> the request above. As far as I understand a conversion from UTF-8 to
> UTF-16BE//NFC would have the same output for the same characters, so that is
> sufficient for me no matter of the underlying standard.

That's not correct.  New characters may do away with the need for using combining characters to represent some glyphs, and so NFC results change.

> My main use case for this request is being able to convert UTF-8 input to a
> UTF-16BE string under the NFC rules (to be used as a password, -and thus the
> output encoding must reproducible).

That's not going to work with glibc, sorry.  You need to record the specific version of Unicode/NFC to use and have support tables for that.  icu should provide this.

Comment 4 Florian Weimer 2016-10-05 15:59:02 UTC

Hmm, If this is the normalization that is used:

  http://www.unicode.org/reports/tr15/#Stability_of_Normalized_Forms

then it should be stable.

So while this is traditionally the domain of ICU, we might be able to support this in glibc.

Do you need this in iconv, or would a wchar_t *-to-wchar_t * conversion do the job as well?

Comment 5 Nikos Mavrogiannopoulos 2016-11-02 12:09:07 UTC

(In reply to Florian Weimer from comment #4)
> Hmm, If this is the normalization that is used:
> 
>   http://www.unicode.org/reports/tr15/#Stability_of_Normalized_Forms
> 
> then it should be stable.
> So while this is traditionally the domain of ICU, we might be able to
> support this in glibc.

As I see it glibc already supports character conversions to and from UTF-8. Without normalization that means that a modern library which has to conform to (any) standard which involved UTF-8 has no way to specify (or even know) the normalization of the output data.

> Do you need this in iconv, or would a wchar_t *-to-wchar_t * conversion do
> the job as well?

I use only iconv(), so I cannot talk about the other APIs.

Comment 6 Nikos Mavrogiannopoulos 2016-11-08 13:15:59 UTC

I no longer thing that the libc-provided APIs are reasonable for unicode processing. I am switching to libunistring.

Note You need to log in before you can comment on or make changes to this bug.