Created attachment 369656 [details] Example test case Currently I'm thinking how to sort UTF-8 strings on GNOME/GDM. GDM uses g_utf8_collate() to sort the language names. http://git.gnome.org./cgit/glib/tree/glib/gunicollate.c g_utf8_collate() uses wcscoll() internally. ---------------------------- 70:gint 71: g_utf8_collate (const gchar *str1, 72: const gchar *str2) 73:{ ... 109: result = wcscoll ((wchar_t *)str1_norm, (wchar_t *)str2_norm); ... 154:} ---------------------------- However if the chars are not defined in the locale collation, the returned value is not correct. I'm attaching the test program. You could see the returned value is not correct on ja_JP.UTF-8. Now I think it's better to set errno if the char is not defined in the collation. Regarding to WCSCOLL(3P): ---------------------------- RETURN VALUE On error, wcscoll() shall set errno, but no return value is reserved to indicate an error. ERRORS The wcscoll() function may fail if: EINVAL The ws1 or ws2 arguments contain wide-character codes outside the domain of the collating sequence. ---------------------------- If wcscoll/strcoll/wscxfrm/strxfrm would set errno, I could enhance g_utf8_collate(_key) later.
Created attachment 369658 [details] Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c The patch sets EINVAL if the value is out of the table.
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle. Changing version to '12'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
In which way is the value not correct?
If the char is not defined in the locale collation, wcscoll() doesn't return the right value. E.g. if you run the attachment 369656 [details] on ja_JP.UTF-8, it compares Korean char and ASCII char on ja_JP.UTF-8 but the Korean char is not defined in the ja collation table and then wcscoll() returns the minus value but the returned value is not correct because the Korean char value is out of the ja collation table. Actually I would expect the returned value is a plus value in this case if the char value is not defined in the collation table. As I noted comment #0, the man of wcscoll explains to set EINVAL if the char is not defined in the locale collating sequence. I'd like to know whether the char is defined in the collation table or not but currently wcscoll()/strcoll() doesn't set errno. My understanding is, __collidx_table_lookup() checks if the char is defined in the collation table so my suggestion is to set errno if the char is not defined in the table.
(In reply to comment #4) > Actually I would expect the returned value is a plus value in this case if the > char value is not defined in the collation table. Correction: Actually I would expect the returned value is a plus value in this case if the char value is defined in the collation table.
The ja_JP.UTF-8 collation table contains all UTF-8 characters, so wcscoll cannot return an error.
(In reply to comment #6) > The ja_JP.UTF-8 collation table contains all UTF-8 characters, so wcscoll > cannot return an error. I don't think so. E.g. glibc/localedata/locales/ja_JP LC_COLLATE doesn't include U+D55C . Reopening the bug.
If you think ja_JP should define all UTF-8 collation, this would be a bug of glibc/localedata/locales/ja_JP. Please note: I don't say ja_JP only but all locales. I need to know the way to get whether the char is defined in the locale collation or not with C API.
All character not explicitly mentioned are collated after all others.
(In reply to comment #9) > All character not explicitly mentioned are collated after all others. I don't understand your mention. Do you mean if a Korean char is not defined in ja_JP, the char is collated after all Japanese chars?
Reset the status since I don't get the reply above. If the answer of my question is yes, the actual result is different from your mention, i.e. I think this bus is still valid. Your reply would be too short for me. Can we talk with chat or email?
Created attachment 372439 [details] Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c, glibc/wcsmbs/wcscoll_l.c, glibc/wcsmbs/wcsxfrm_l.c This patch is another solution. if a char is not defined in glib/localedata/locales/ja_JP, __collidx_table_lookup() returns 0. If we could use __collseq_table_lookup() instead, it can return the max value for the undefined char. But I think we need to use __collidx_table_lookup() for wcscoll() since the size of locale collation is unclear. But the problem is when we receive 0, U+0 is actually defined in glib/localedata/locales/ja_JP LC_COLLATION and the result is undefined chars are always collated before defined chars in wcscoll(). E.g. If I think a is ASCII char, b is a Japanese char, c is a Korean char, the collation would be c < a < b on ja_JP.UTF-8 since U+0 is defined in ja_JP file. But if you look at ja_JP file, the file also defines "UNDEFINED" in LC_COLLATE. UNDEFINED char should be collated at last. But the word "UNDEFINED" seems to be used in localedef program only. If we run wcscoll(), we don't know which index of weight[] is the UNDEFINED value. So my solution is, if wcscoll() receives 0 from findidx(), wcscoll() use USTRING_MAX instead of weight[]. If I see zh_CN file, U+0 is not defined. The undefined chars are always collated before defined chars in wcscoll() because the following line effects the result in wcscoll(): result = seq1len == 0 ? -1 : 1; seq1len is 0 but the string is not shorter than the other in this case. The string is not defined in the locale collation in this case actually. I'd modified this part. Probably it's good for wcscoll() to follow the 'UNDEFINED' keyword in the locale collation file and I think 'UNDEFINED' should be put in the last of the LC_COLLATE.
Created attachment 372440 [details] Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c, glibc/wcsmbs/wcscoll_l.c, glibc/wcsmbs/wcsxfrm_l.c Correct the file names in the patch slightly.
During the discussion in glibc community, I found there are several problems if we fix this in glibc side. Currently I'm discussing this issue in the upper layer application so closing this bug at the moment.