Bug 104892

Summary:	There is a need for more than one sorting strategy in sv locale
Product:	[Retired] Red Hat Linux Beta	Reporter:	Göran Uddeborg <goeran>
Component:	glibc	Assignee:	Jakub Jelinek <jakub>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	beta1	CC:	fweimer
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-28 06:51:27 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Göran Uddeborg 2003-09-23 10:54:13 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030703

Description of problem:
Is there some "upstream" for glibc nowdays?  I tried using glibcbug first, but
my message bounced.

The way to sort a phrase in Swedish depends on the context.  There is one
principle, called the "dictionary principle", where letter by letter is
compared, and spaces between words are ignored.  That is, no surprise, the
common way to sort a dictionary.  With the other principle, called the "word
principle", the words are compared with each other, and only if the first word
is the same in both phrases, the second word is taken into consideration.  This
is used in phone books, libraries, and other places.  Technically, you could do
this by sorting a space first of all, before all letters.

The different principles are used in different contexts.  The current locale
definitions which comes with glibc applies the dictionary principle.  I believe
this is a good default; in most places where the definition is used it is
approrpiate.  But what would be the "correct" way to make it possible to choose
the word principle where THAT is appropriate?

The background for this request is a letter I got from a user working at a
library (Rolf Johansson <rojo>).  They use Linux for their databases.
 Their database system, PostgreSQL, leaves collation order to the system's
locale definition.  This means they get dictionary order.  They would need word
order, since that is well established among libraries.

What is the correct thing for him to do?  What kind of patch/program
modification, should be done to make it possible?

Version-Release number of selected component (if applicable):
glibc-2.3.1-36

How reproducible:
Always

Steps to Reproduce:
1.
cat > apa
a conto
a priori
apparat
^D
2. env LANG=sv_SE sort apa

Actual Results:  
a conto
apparat
a priori


Expected Results:  In most contexts, the order I get.  But in some application
areas, I expect

a conto
a priori
apparat

Additional info:

Defining a new collation order in the locale is obviously one way to do this. 
But I'm uncertain if it is the best way.  What would you suggest?

I don't know if this problem is applicable to other languages too.

Comment 1 Bevan Bennett 2004-03-02 17:24:31 UTC

I was rather baffled by the new sort order as well, but have recently
realized that sort appears to be sorting without regard to
non-alphanumeric characters in non-C locales on the first pass.

So we currently sort to:
aaaaaaa
A and G motor vehicles
abalone
Andersen, Hans Christian
$$$ and no sense
$$$ and sense
Antigone

Is this -really- the specified behavior for UTF-8 locales?

I don't personally know of anyone who wants or expects this behavior.
Can we at least get switches added to sort and join that will
selectively disable this behavior and pay attention to
non-alphanumerics in the sort?

Comment 2 Jakub Jelinek 2004-03-02 17:34:14 UTC

This is not about UTF-8 locales, but about what sorting is common
for various languages.  If you look into a dictionary, you'll see
the order you get.  ANd it is certainly not something recent, sort
has been behaving like that for a few years already.

As for the original request, I think such non-standard handling
belongs into the applications which need such handling.

Comment 3 Göran Uddeborg 2004-06-08 20:20:23 UTC

In the particular case that would make the application significantly
more complex.  It today is using a PostgreSQL database, and functions
like sorting is done by the database.  It is not tempting to have to
redo it in the application.

Currently (or last I heard), they had defined a non-standard locale
instead.  That was deemed to be less complicated.  To me it feels
unfortunate one should have to do that.

Comment 4 Ulrich Drepper 2004-09-28 06:51:27 UTC

If the different sorting order can be expressed using the
specification language localedef can provide (and I think it can, just
define a high enough priority to whitespaces), then define your own
locale sv_SE@wordorder or so.  This data need not come with glibc,
just put it in a separate package and use localedef at installation
time to create the binary form.  I have no interest for glibc to get
into these kinds of details.  We provide a good default, anything else
is up to specialized "localization" packages.  I'm closing this bug as
WONTFIX since something like this will not get into the upstream nor
RH glibc package.