Bug 537753 - glibc: wcscoll/strcoll needs to check if the char is out of the locale collation.
Summary: glibc: wcscoll/strcoll needs to check if the char is out of the locale collat...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: 12
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Andreas Schwab
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-11-16 08:55 UTC by fujiwara
Modified: 2010-07-01 04:58 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-07-01 04:58:58 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Example test case (1.51 KB, text/plain)
2009-11-16 08:55 UTC, fujiwara
no flags Details
Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c (1.15 KB, patch)
2009-11-16 08:57 UTC, fujiwara
no flags Details | Diff
Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c, glibc/wcsmbs/wcscoll_l.c, glibc/wcsmbs/wcsxfrm_l.c (3.77 KB, patch)
2009-11-20 09:21 UTC, fujiwara
no flags Details | Diff
Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c, glibc/wcsmbs/wcscoll_l.c, glibc/wcsmbs/wcsxfrm_l.c (3.77 KB, patch)
2009-11-20 09:24 UTC, fujiwara
no flags Details | Diff

Description fujiwara 2009-11-16 08:55:24 UTC
Created attachment 369656 [details]
Example test case

Currently I'm thinking how to sort UTF-8 strings on GNOME/GDM.

GDM uses g_utf8_collate() to sort the language names.
http://git.gnome.org./cgit/glib/tree/glib/gunicollate.c

g_utf8_collate() uses wcscoll() internally.

----------------------------
 70:gint
 71: g_utf8_collate (const gchar *str1,
 72:		const gchar *str2)
 73:{
...
109:  result = wcscoll ((wchar_t *)str1_norm, (wchar_t *)str2_norm);
...
154:}
----------------------------

However if the chars are not defined in the locale collation, the returned value is not correct.
I'm attaching the test program.
You could see the returned value is not correct on ja_JP.UTF-8.

Now I think it's better to set errno if the char is not defined in the collation.
Regarding to WCSCOLL(3P):

----------------------------
RETURN VALUE
On  error,  wcscoll()  shall  set
       errno, but no return value is reserved to indicate an error.

ERRORS
       The wcscoll() function may fail if:

       EINVAL The  ws1  or  ws2 arguments contain wide-character codes outside
              the domain of the collating sequence.
----------------------------

If wcscoll/strcoll/wscxfrm/strxfrm would set errno, I could enhance g_utf8_collate(_key) later.

Comment 1 fujiwara 2009-11-16 08:57:23 UTC
Created attachment 369658 [details]
Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c

The patch sets EINVAL if the value is out of the table.

Comment 2 Bug Zapper 2009-11-16 15:41:36 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 3 Andreas Schwab 2009-11-17 13:12:44 UTC
In which way is the value not correct?

Comment 4 fujiwara 2009-11-18 00:41:13 UTC
If the char is not defined in the locale collation, wcscoll() doesn't return the right value.

E.g. if you run the attachment 369656 [details] on ja_JP.UTF-8, it compares Korean char and ASCII char on ja_JP.UTF-8 but the Korean char is not defined in the ja collation table and then wcscoll() returns the minus value but the returned value is not correct because the Korean char value is out of the ja collation table.
Actually I would expect the returned value is a plus value in this case if the char value is not defined in the collation table.

As I noted comment #0, the man of wcscoll explains to set EINVAL if the char is not defined in the locale collating sequence.

I'd like to know whether the char is defined in the collation table or not but currently wcscoll()/strcoll() doesn't set errno.
My understanding is, __collidx_table_lookup() checks if the char is defined in the collation table so my suggestion is to set errno if the char is not defined in the table.

Comment 5 fujiwara 2009-11-18 02:01:05 UTC
(In reply to comment #4)
> Actually I would expect the returned value is a plus value in this case if the
> char value is not defined in the collation table.

Correction:
Actually I would expect the returned value is a plus value in this case if the char value is defined in the collation table.

Comment 6 Andreas Schwab 2009-11-18 10:46:47 UTC
The ja_JP.UTF-8 collation table contains all UTF-8 characters, so wcscoll cannot return an error.

Comment 7 fujiwara 2009-11-18 12:02:36 UTC
(In reply to comment #6)
> The ja_JP.UTF-8 collation table contains all UTF-8 characters, so wcscoll
> cannot return an error.  

I don't think so.
E.g. glibc/localedata/locales/ja_JP LC_COLLATE doesn't include U+D55C .
Reopening the bug.

Comment 8 fujiwara 2009-11-18 12:14:13 UTC
If you think ja_JP should define all UTF-8 collation, this would be a bug of glibc/localedata/locales/ja_JP.

Please note: I don't say ja_JP only but all locales.
I need to know the way to get whether the char is defined in the locale collation or not with C API.

Comment 9 Andreas Schwab 2009-11-18 12:28:04 UTC
All character not explicitly mentioned are collated after all others.

Comment 10 fujiwara 2009-11-18 12:56:46 UTC
(In reply to comment #9)
> All character not explicitly mentioned are collated after all others.  

I don't understand your mention.
Do you mean if a Korean char is not defined in ja_JP, the char is collated after all Japanese chars?

Comment 11 fujiwara 2009-11-18 13:23:18 UTC
Reset the status since I don't get the reply above.

If the answer of my question is yes, the actual result is different from your mention, i.e. I think this bus is still valid.
Your reply would be too short for me. Can we talk with chat or email?

Comment 12 fujiwara 2009-11-20 09:21:31 UTC
Created attachment 372439 [details]
Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c, glibc/wcsmbs/wcscoll_l.c, glibc/wcsmbs/wcsxfrm_l.c

This patch is another solution.

if a char is not defined in glib/localedata/locales/ja_JP, __collidx_table_lookup() returns 0.
If we could use __collseq_table_lookup() instead, it can return the max value for the undefined char.
But I think we need to use __collidx_table_lookup() for wcscoll() since the size of locale collation is unclear.

But the problem is when we receive 0, U+0 is actually defined in glib/localedata/locales/ja_JP LC_COLLATION and the result is undefined chars are always collated before defined chars in wcscoll().

E.g. If I think a is ASCII char, b is a Japanese char, c is a Korean char, the collation would be c < a < b on ja_JP.UTF-8 since U+0 is defined in ja_JP file.

But if you look at ja_JP file, the file also defines "UNDEFINED" in LC_COLLATE.
UNDEFINED char should be collated at last.
But the word "UNDEFINED" seems to be used in localedef program only.
If we run wcscoll(), we don't know which index of weight[] is the UNDEFINED value.
So my solution is, if wcscoll() receives 0 from findidx(), wcscoll() use USTRING_MAX instead of weight[].

If I see zh_CN file, U+0 is not defined. The undefined chars are always collated before defined chars in wcscoll() because the following line effects the result in wcscoll():

	  result = seq1len == 0 ? -1 : 1;

seq1len is 0 but the string is not shorter than the other in this case.
The string is not defined in the locale collation in this case actually.

I'd modified this part.

Probably it's good for wcscoll() to follow the 'UNDEFINED' keyword in the locale collation file and I think 'UNDEFINED' should be put in the last of the LC_COLLATE.

Comment 13 fujiwara 2009-11-20 09:24:12 UTC
Created attachment 372440 [details]
Patch for glibc/string/strcoll_l.c, glibc/string/strxfrm_l.c, glibc/wcsmbs/wcscoll_l.c, glibc/wcsmbs/wcsxfrm_l.c

Correct the file names in the patch slightly.

Comment 14 fujiwara 2010-07-01 04:58:58 UTC
During the discussion in glibc community, I found there are several problems if we fix this in glibc side.
Currently I'm discussing this issue in the upper layer application so closing this bug at the moment.


Note You need to log in before you can comment on or make changes to this bug.