Hide Forgot
Description of problem: The UTF8 and UTF16 encodings do not normally provide a unique encoding. There are multiple normalization forms (e.g., NFC, NFD) [0] which result to different encodings. Although older standards such as PKIX, and ASN.1 ask simply for UTF-8 and UTF-16 fields, in the new standards a normalization form is being mandated (e.g., see https://tools.ietf.org/html/rfc7512#section-5 ). As such, the iconv_open() interface is not sufficient as it is to cover such applications. They can convert to UTF-16BE, but they have no way to specify UTF-16BE//NFC or alternatives. Moreover, the manual of iconv_open() does not mention the normalization form of its default output at all. [0]. http://unicode.org/reports/tr15/ How that can be addressed: - The manual page of iconv_open mentions the default normalization form - iconv_open() allows an NF form to be specified
Do you need a specific version of the Unicode standard for stability? Then iconv will never be what you need because the plan is only to support the latest Unicode standard at the time of a glibc release.
No I do not think that a specific version of the standard is required for the request above. As far as I understand a conversion from UTF-8 to UTF-16BE//NFC would have the same output for the same characters, so that is sufficient for me no matter of the underlying standard. My main use case for this request is being able to convert UTF-8 input to a UTF-16BE string under the NFC rules (to be used as a password, -and thus the output encoding must reproducible).
(In reply to Nikos Mavrogiannopoulos from comment #2) > No I do not think that a specific version of the standard is required for > the request above. As far as I understand a conversion from UTF-8 to > UTF-16BE//NFC would have the same output for the same characters, so that is > sufficient for me no matter of the underlying standard. That's not correct. New characters may do away with the need for using combining characters to represent some glyphs, and so NFC results change. > My main use case for this request is being able to convert UTF-8 input to a > UTF-16BE string under the NFC rules (to be used as a password, -and thus the > output encoding must reproducible). That's not going to work with glibc, sorry. You need to record the specific version of Unicode/NFC to use and have support tables for that. icu should provide this.
Hmm, If this is the normalization that is used: http://www.unicode.org/reports/tr15/#Stability_of_Normalized_Forms then it should be stable. So while this is traditionally the domain of ICU, we might be able to support this in glibc. Do you need this in iconv, or would a wchar_t *-to-wchar_t * conversion do the job as well?
(In reply to Florian Weimer from comment #4) > Hmm, If this is the normalization that is used: > > http://www.unicode.org/reports/tr15/#Stability_of_Normalized_Forms > > then it should be stable. > So while this is traditionally the domain of ICU, we might be able to > support this in glibc. As I see it glibc already supports character conversions to and from UTF-8. Without normalization that means that a modern library which has to conform to (any) standard which involved UTF-8 has no way to specify (or even know) the normalization of the output data. > Do you need this in iconv, or would a wchar_t *-to-wchar_t * conversion do > the job as well? I use only iconv(), so I cannot talk about the other APIs.
I no longer thing that the libc-provided APIs are reasonable for unicode processing. I am switching to libunistring.