Bug 100938 - Incorrect collation order
Summary: Incorrect collation order
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: 1
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jakub Jelinek
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-07-27 17:01 UTC by Alan Cox
Modified: 2007-11-30 22:10 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-07-26 15:38:59 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
CY ordering (UTF-8) (398 bytes, text/plain)
2004-09-28 13:21 UTC, Alan Cox
no flags Details

Description Alan Cox 2003-07-27 17:01:12 UTC
Description of problem:

In cy_GB.UTF-8 the collation order correctly sorts unaccented symbols. Accented
symbols are supposed to be sorted with the accent ignored but this does not occur

(That is the symbols aeiouwy with the accented forms a^ a/ a\ a" etc)  [I can't
put the symbols in because bugzilla's form seems to be 8859-1]

Comment 1 Ulrich Drepper 2004-09-28 04:11:55 UTC
Attach a test file.  I.e., a line with different words on separate
lines with the lines in the order in which they must appear.  These
need not be real words, just character sequences are OK.

Comment 2 Alan Cox 2004-09-28 13:21:57 UTC
Created attachment 104429 [details]
CY ordering (UTF-8)

Comment 3 Ulrich Drepper 2005-07-26 15:38:59 UTC
I actually tried this now.  The sorting order seems to be correct
already/meanwhile.a
A
á
Ã
à
Ã
â
Ã
ä
Ã
b
B

This is the beginning.  The various accented characters are at the highest level
sorted along with the non-accented variant.  The only difference between this
sorting and what you I think hint at in the ordering file is that the accents
should be treated with a lower priority than the case.  But that is a choice of
the collation standard.  I do not have the intention to change that.  It would
mean changing the entire huge collation file.  E.g.,

<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a
<U00E1> <a>;<ACA>;<MIN>;IGNORE # 200
<U0041> <a>;<BAS>;<CAP>;IGNORE # 319 A
<U00C1> <a>;<ACA>;<CAP>;IGNORE # 320

These are the entries for 'a' and 'á'.  If you'd want the accents to have a
lower priority each and every character definition would have the second and
third field reversed:

<U0061> <a>;<MIN>;<BAS>;IGNORE # 198 a

This is not only a lot of work (which I won't do), it also would mean that this
locale is different from any other locale in this respect.

I'm closing the bug is WORKSFORME since this is what I think it does.


Note You need to log in before you can comment on or make changes to this bug.