17005 – Broken sorting with Swedish locale

Bug 17005 - Broken sorting with Swedish locale

Summary: Broken sorting with Swedish locale

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	glibc
Sub Component:
Version:	7.1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2000-08-27 16:56 UTC by Christian Rose
Modified:	2008-05-01 15:37 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2000-09-01 11:10:57 UTC
Embargoed:

Attachments	(Terms of Use)

Description Christian Rose 2000-08-27 16:56:37 UTC

When testing sorting under RC2, I discovered that it doesn't sort according
to Swedish standard.
This was the contents of my sample dir:
"A B C D E F G H I J K L M N O P Q R S T U U-umlaut V W X Y Z A-ring
A-umlaut AE-character O-umlaut O-slash a b c d e f g h i j k l m n o p q r
s t u v w xy z a-ring a-umlaut ae-character o-umlaut o-slash"
Each of this was a file with a one-character name (bugzilla doesn't like
latin1 chars, so I named them in this bug report).


The result of ls -l was:
"A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k
l m n o p q r s t u v w x y z A-umlaut A-ring AE -character O-umlaut
O-slash U-umlaut a-umlaut a-ring ae-character o-umlaut o-slash u-umlaut"


The result of ls | sort was:
"a A ae-character AE-character b B c C d D e E f F g G h H i I j J k K l L
m M n N o O p q Q r R s S t T u U v V w W x X y Y u-umlaut U-umlaut z Z
a-ring A-ring a-umlaut A-umlaut o-umlaut O-umlaut o-slash O-slash"

A bit better, but not entirely correct.


These are the ordering rules, commonly used in Swedish:

* No difference is made between small/capital letters (although many seem
to prefer a sorted _after_ a, if that's the only difference in that
character position)

* a-ring, a-umlaut and o-ring are sorted (in that order) after the a-z
letters

* w is sorted/treated as v (however preferrably sorted _after_ v, if it's
the only difference in that character position)

* u-umlaut is sorted/treated as y (however preferrably sorted _after_ y, if
it's the only difference in that character position)

* ae-character is sorted/treated as a-umlaut (however preferrably sorted
_after_ a-umlaut, if it's the only difference in that character position)

* o-slash is sorted/treated as o-umlaut (however preferrably sorted _after_
o-umlaut, if it's the only difference in that character position)


Hence, a correctly sorted output from ls or sort should be (according to
these rules):

"A a B b C c D d E e F f G g H h I i J j K k L l M m N n O o P p Q q R r S
s T t U u V v W w X x Y y U-umlaut u-umlaut Z z A-ring a-ring A-umlaut
a-umlaut AE-character ae-character O-umlaut o-umlaut O-slash o-slash"


My locale during my tests was:
LANG=sv_SE
LC_CTYPE="sv_SE"
LC_NUMERIC="sv_SE"
LC_TIME="sv_SE"
LC_COLLATE="sv_SE"
LC_MONETARY="sv_SE"
LC_MESSAGES="sv_SE"
LC_PAPER="sv_SE"
LC_NAME="sv_SE"
LC_ADDRESS="sv_SE"
LC_TELEPHONE="sv_SE"
LC_MEASUREMENT="sv_SE"
LC_IDENTIFICATION="sv_SE"
LC_ALL=

Comment 1 Ulrich Drepper 2000-08-27 20:37:33 UTC

> +The result of ls -l was:
> +"A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k
> +l m n o p q r s t u v w x y z A-umlaut A-ring AE -character O-umlaut
> +O-slash U-umlaut a-umlaut a-ring ae-character o-umlaut o-slash u-umlaut"

This is expected.  At least the version of make (which is not from
RH7) is not using strcoll but strcmp.  It should probably be changed...

> +The result of ls | sort was:
> +"a A ae-character AE-character b B c C d D e E f F g G h H i I j J k K l L
> +m M n N o O p q Q r R s S t T u U v V w W x X y Y u-umlaut U-umlaut z Z
> +a-ring A-ring a-umlaut A-umlaut o-umlaut O-umlaut o-slash O-slash"

Well, sv_SE had an old customize LC_COLLATE description.  I left it in
because I thought it was more correct.  It's gone now.  I'm working on
improvements to localedef to allow customization of the generic
LC_COLLATE specification in sv_SE.

> +* No difference is made between small/capital letters (although many seem
> +to prefer a sorted _after_ a, if that's the only difference in that
> +character position)

The upper/lower case relation also must be parametrized.  The German
rule is different (lower before upper).

> +* w is sorted/treated as v (however preferrably sorted _after_ v, if it's
> +the only difference in that character position)

That's new to me.  Are you sure this still is used in practice or is
just something historic?  Languages like English also had no 'w' for a
long time but it got introduced and then, to be able to handle foreign
words, is handled as a separate character just like it it in English
etc today.

We'll have something changed available at some time.

Comment 2 Göran Uddeborg 2000-08-27 21:29:58 UTC

>> +* w is sorted/treated as v (however preferrably sorted _after_ v, if it's
>> +the only difference in that character position)

> That's new to me.  Are you sure this still is used in practice or is
> just something historic?  Languages like English also had no 'w' for a
> long time but it got introduced and then, to be able to handle foreign
> words, is handled as a separate character just like it it in English
> etc today.

W isn't used in Swedish. It only appears in foreign words like names and such.
Similarily to other such characters, like the Danish ae-character, the German
u-diaeresis,  it is sorted  as if it were the most similar Swedish character. 
In the
case of w, the most similar character is v.

> The German rule is different (lower before upper).

There isn't any real rules in Swedish on the sorting of lower versus upper case.
I just checked two reputable dictionaries, and they both sorted "bonde" before
"Bonde". Either way could be argued.

Comment 3 Christian Rose 2000-08-27 22:16:37 UTC

Well, I've got the book "Svenska skrivregler" here (ISBN 9121112800, 1999, only
availiable in Swedish though) by Svenska Spraknamnden ("Committe of the Swedish
language"), which covers common rules and guidelines for Swedish writing.

It also has a section on the alphabetical ordering of "w", which I'll quote (my
rough translation):

"The letter w is normally not present in the Swedish alphabet. It exists in some
names in Swedish and foreign words, but is accounted for as a variant of 'v'.
Words and names with 'w' are in Swedish ordered alphabetically among the words
and names with 'v'. If two words or names are only to be distinguished by 'v' or
'w', 'v' is placed before 'w'."

It goes on to tell how the situation is the same with "y" and "u-umlaut", and
how "u-umlaut" should be treated the same as "y", and "y" ordered before
"u-umlaut" if words are only to be distinguished by that letter.


There ends the facts were I was correct, however ;-)
The next section tells that small letters should indeed be placed BEFORE their
capital counterparts. As in German, as you said. I was very wrong about that.
I'm terribly sorry (also it could be noted that not just me but everybody I've
spoken to, including Goeran, didn't know about this).

So "atlas" should be placed before "Atlas", and "sten" before "Sten", but
"Armani" before "armatur".

So

"a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R s S t T
u U v V w W x X y Y u-umlaut U-umlaut z Z a-ring A-ring a-umlaut A-umlaut
ae-character AE-character o-umlaut O-umlaut o-slash O-slash"

should be the correct ordering in Swedish, with the small characters before
their capitals.
So "atlas" should be placed before "Atlas", and "sten" before "Sten", but


There is also a note that numbers and other characters are normally ordered
before alphabetical characters, but that's all that is said. I think this is the
behavior that's already present in glibc though :)


There is also a recommendation for the ordering of characters that are not
present in Swedish, based on how common they are (more common characters placed
earlier) and their similarities. These characters are normally treated the same
as the character they're based on, but if the only difference is that character,
this is the recommended ordering (with my namings):
a (a-acute a-grave a-circumflex)
b
c (c-cedilla c-grave "c-inversecircumflex")
d ("that small d-like character with a small bar")
e (e-acute e-grave e-circumflex e-umlaut)
f
g
h
i (i-acute i-grave i-circumflex i-umlaut)
j
k
l ("l with a small bar across")
m
n (n-acute n-tilde)
o (o-acute o-grave o-circumflex)
p
q
r ("r-inversecircumflex")
s (s-acute "s-inversecircumflex")
t
u (u-acute u-grave u-circumflex)
v (w)
x
y
z
a-ring
a-umlaut (ae-character)
o-umlaut (o-slash)

Comment 4 Ulrich Drepper 2000-08-28 03:55:20 UTC

I've fixed this now.  It required a significant amount of changes to localedef
and the old LC_COLLATE specification is completely gone.  We are now using the
generic specification, customized according to the information you gave we.
This happens without duplication.  With sort I get now the order you provided.

I don't know when this code will be available in an RPM to try but please
let me know once you got a chance to try it.

Comment 5 Göran Uddeborg 2000-08-28 10:30:38 UTC

> The next section tells that small letters should indeed be placed BEFORE their
> capital counterparts. ... I was very wrong about that.
> ... everybody I've
> spoken to, including Goeran, didn't know about this).

I stand corrected.  What's slightly unnerving is that that is the source I once
used
to learn about the sorting of  w and other foreign characters, and I thought I
rememberd there wasn't anything about upper versus lower case.  I guess I'm
getting old!

I've reread the section now.  The only thing I can add to what Christian said
is on the treatment of the ae-character.  The Danish character is considered the
same as a-diaeresis as just described.  In Latin words, however, the
ae-character
is considered two distinct letters written together and sorted as a + e.  Now
that's
a challenge for the localedef. :-)  (Seriously, consider it a Danish character;
that will
be the common case.)

Comment 6 Jakub Jelinek 2000-09-01 11:10:55 UTC

Ulrich patches have made it into glibc-2.1.92-14. Thanks.

Comment 7 Christian Rose 2000-10-03 01:43:41 UTC

Now when bugzilla is up again it's probably time for a follow-up:

Tested with sort and glibc 2.1.92-14 on a Red Hat 7 system, and it works exactly
as expected, with the rules outlined above.

A big thanks!

Note You need to log in before you can comment on or make changes to this bug.