Bug 17005
Summary: | Broken sorting with Swedish locale | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Christian Rose <menthos> |
Component: | glibc | Assignee: | Jakub Jelinek <jakub> |
Status: | CLOSED RAWHIDE | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 7.1 | CC: | drepper, goeran |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2000-09-01 11:10:57 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Christian Rose
2000-08-27 16:56:37 UTC
> +The result of ls -l was: > +"A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k > +l m n o p q r s t u v w x y z A-umlaut A-ring AE -character O-umlaut > +O-slash U-umlaut a-umlaut a-ring ae-character o-umlaut o-slash u-umlaut" This is expected. At least the version of make (which is not from RH7) is not using strcoll but strcmp. It should probably be changed... > +The result of ls | sort was: > +"a A ae-character AE-character b B c C d D e E f F g G h H i I j J k K l L > +m M n N o O p q Q r R s S t T u U v V w W x X y Y u-umlaut U-umlaut z Z > +a-ring A-ring a-umlaut A-umlaut o-umlaut O-umlaut o-slash O-slash" Well, sv_SE had an old customize LC_COLLATE description. I left it in because I thought it was more correct. It's gone now. I'm working on improvements to localedef to allow customization of the generic LC_COLLATE specification in sv_SE. > +* No difference is made between small/capital letters (although many seem > +to prefer a sorted _after_ a, if that's the only difference in that > +character position) The upper/lower case relation also must be parametrized. The German rule is different (lower before upper). > +* w is sorted/treated as v (however preferrably sorted _after_ v, if it's > +the only difference in that character position) That's new to me. Are you sure this still is used in practice or is just something historic? Languages like English also had no 'w' for a long time but it got introduced and then, to be able to handle foreign words, is handled as a separate character just like it it in English etc today. We'll have something changed available at some time. >> +* w is sorted/treated as v (however preferrably sorted _after_ v, if it's >> +the only difference in that character position) > That's new to me. Are you sure this still is used in practice or is > just something historic? Languages like English also had no 'w' for a > long time but it got introduced and then, to be able to handle foreign > words, is handled as a separate character just like it it in English > etc today. W isn't used in Swedish. It only appears in foreign words like names and such. Similarily to other such characters, like the Danish ae-character, the German u-diaeresis, it is sorted as if it were the most similar Swedish character. In the case of w, the most similar character is v. > The German rule is different (lower before upper). There isn't any real rules in Swedish on the sorting of lower versus upper case. I just checked two reputable dictionaries, and they both sorted "bonde" before "Bonde". Either way could be argued. Well, I've got the book "Svenska skrivregler" here (ISBN 9121112800, 1999, only availiable in Swedish though) by Svenska Spraknamnden ("Committe of the Swedish language"), which covers common rules and guidelines for Swedish writing. It also has a section on the alphabetical ordering of "w", which I'll quote (my rough translation): "The letter w is normally not present in the Swedish alphabet. It exists in some names in Swedish and foreign words, but is accounted for as a variant of 'v'. Words and names with 'w' are in Swedish ordered alphabetically among the words and names with 'v'. If two words or names are only to be distinguished by 'v' or 'w', 'v' is placed before 'w'." It goes on to tell how the situation is the same with "y" and "u-umlaut", and how "u-umlaut" should be treated the same as "y", and "y" ordered before "u-umlaut" if words are only to be distinguished by that letter. There ends the facts were I was correct, however ;-) The next section tells that small letters should indeed be placed BEFORE their capital counterparts. As in German, as you said. I was very wrong about that. I'm terribly sorry (also it could be noted that not just me but everybody I've spoken to, including Goeran, didn't know about this). So "atlas" should be placed before "Atlas", and "sten" before "Sten", but "Armani" before "armatur". So "a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R s S t T u U v V w W x X y Y u-umlaut U-umlaut z Z a-ring A-ring a-umlaut A-umlaut ae-character AE-character o-umlaut O-umlaut o-slash O-slash" should be the correct ordering in Swedish, with the small characters before their capitals. So "atlas" should be placed before "Atlas", and "sten" before "Sten", but There is also a note that numbers and other characters are normally ordered before alphabetical characters, but that's all that is said. I think this is the behavior that's already present in glibc though :) There is also a recommendation for the ordering of characters that are not present in Swedish, based on how common they are (more common characters placed earlier) and their similarities. These characters are normally treated the same as the character they're based on, but if the only difference is that character, this is the recommended ordering (with my namings): a (a-acute a-grave a-circumflex) b c (c-cedilla c-grave "c-inversecircumflex") d ("that small d-like character with a small bar") e (e-acute e-grave e-circumflex e-umlaut) f g h i (i-acute i-grave i-circumflex i-umlaut) j k l ("l with a small bar across") m n (n-acute n-tilde) o (o-acute o-grave o-circumflex) p q r ("r-inversecircumflex") s (s-acute "s-inversecircumflex") t u (u-acute u-grave u-circumflex) v (w) x y z a-ring a-umlaut (ae-character) o-umlaut (o-slash) I've fixed this now. It required a significant amount of changes to localedef and the old LC_COLLATE specification is completely gone. We are now using the generic specification, customized according to the information you gave we. This happens without duplication. With sort I get now the order you provided. I don't know when this code will be available in an RPM to try but please let me know once you got a chance to try it. > The next section tells that small letters should indeed be placed BEFORE their
> capital counterparts. ... I was very wrong about that.
> ... everybody I've
> spoken to, including Goeran, didn't know about this).
I stand corrected. What's slightly unnerving is that that is the source I once
used
to learn about the sorting of w and other foreign characters, and I thought I
rememberd there wasn't anything about upper versus lower case. I guess I'm
getting old!
I've reread the section now. The only thing I can add to what Christian said
is on the treatment of the ae-character. The Danish character is considered the
same as a-diaeresis as just described. In Latin words, however, the
ae-character
is considered two distinct letters written together and sorted as a + e. Now
that's
a challenge for the localedef. :-) (Seriously, consider it a Danish character;
that will
be the common case.)
Ulrich patches have made it into glibc-2.1.92-14. Thanks. Now when bugzilla is up again it's probably time for a follow-up: Tested with sort and glibc 2.1.92-14 on a Red Hat 7 system, and it works exactly as expected, with the rules outlined above. A big thanks! |