Red Hat Bugzilla – Bug 104892
There is a need for more than one sorting strategy in sv locale
Last modified: 2007-04-18 12:57:43 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030703
Description of problem:
Is there some "upstream" for glibc nowdays? I tried using glibcbug first, but
my message bounced.
The way to sort a phrase in Swedish depends on the context. There is one
principle, called the "dictionary principle", where letter by letter is
compared, and spaces between words are ignored. That is, no surprise, the
common way to sort a dictionary. With the other principle, called the "word
principle", the words are compared with each other, and only if the first word
is the same in both phrases, the second word is taken into consideration. This
is used in phone books, libraries, and other places. Technically, you could do
this by sorting a space first of all, before all letters.
The different principles are used in different contexts. The current locale
definitions which comes with glibc applies the dictionary principle. I believe
this is a good default; in most places where the definition is used it is
approrpiate. But what would be the "correct" way to make it possible to choose
the word principle where THAT is appropriate?
The background for this request is a letter I got from a user working at a
library (Rolf Johansson <email@example.com>). They use Linux for their databases.
Their database system, PostgreSQL, leaves collation order to the system's
locale definition. This means they get dictionary order. They would need word
order, since that is well established among libraries.
What is the correct thing for him to do? What kind of patch/program
modification, should be done to make it possible?
Version-Release number of selected component (if applicable):
Steps to Reproduce:
cat > apa
2. env LANG=sv_SE sort apa
Expected Results: In most contexts, the order I get. But in some application
areas, I expect
Defining a new collation order in the locale is obviously one way to do this.
But I'm uncertain if it is the best way. What would you suggest?
I don't know if this problem is applicable to other languages too.
I was rather baffled by the new sort order as well, but have recently
realized that sort appears to be sorting without regard to
non-alphanumeric characters in non-C locales on the first pass.
So we currently sort to:
A and G motor vehicles
Andersen, Hans Christian
$$$ and no sense
$$$ and sense
Is this -really- the specified behavior for UTF-8 locales?
I don't personally know of anyone who wants or expects this behavior.
Can we at least get switches added to sort and join that will
selectively disable this behavior and pay attention to
non-alphanumerics in the sort?
This is not about UTF-8 locales, but about what sorting is common
for various languages. If you look into a dictionary, you'll see
the order you get. ANd it is certainly not something recent, sort
has been behaving like that for a few years already.
As for the original request, I think such non-standard handling
belongs into the applications which need such handling.
In the particular case that would make the application significantly
more complex. It today is using a PostgreSQL database, and functions
like sorting is done by the database. It is not tempting to have to
redo it in the application.
Currently (or last I heard), they had defined a non-standard locale
instead. That was deemed to be less complicated. To me it feels
unfortunate one should have to do that.
If the different sorting order can be expressed using the
specification language localedef can provide (and I think it can, just
define a high enough priority to whitespaces), then define your own
locale sv_SE@wordorder or so. This data need not come with glibc,
just put it in a separate package and use localedef at installation
time to create the binary form. I have no interest for glibc to get
into these kinds of details. We provide a good default, anything else
is up to specialized "localization" packages. I'm closing this bug as
WONTFIX since something like this will not get into the upstream nor
RH glibc package.