Bug 73394

Summary: default LC_COLLATE should be C
Product: [Retired] Red Hat Linux Reporter: seth arnold <sarnold>
Component: distributionAssignee: Bill Nottingham <notting>
Status: CLOSED NOTABUG QA Contact: Brock Organ <borgan>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: ali, ed, jkeating, mitr, rvokal
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2002-09-04 03:21:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description seth arnold 2002-09-03 23:39:53 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827

Description of problem:
Greetings; I've been surprised by the behavior of applications (sort and bash
come to mind immediately) that have unexpected behavior, such as:
$ touch a A b
$ ls [a-z]
A  a  b

I can understand the desire to be POSIX compliant with respect to honoring
LC_COLLATE. It even makes good sense.

However, I think that many people are expecting the traditional UNIX ascii-based
character range handling. (It has certainly been traditional ascii-based for the
eight years I've been using UNIX and UNIX-like systems; I understand it has been
this way for twenty years or longer.)

I suppose the question comes down to "What is RedHat?" If it is "a windows
competitor", then I suppose the current behavior is "correct", albeit
surprising. If it is "a UNIX competitor", then I think behaviors such as this
are going to alienate many users -- with the possibility of vast and dire
consequences for people who are moving their software from other UNIX systems to
RedHat Linux. (Imagine a script running rm [a-z]*, "knowing" that files with
leading lowercase characters are temporary, while leading uppercase characters
are for keeping.)

Searching bugzilla for the number of bugs related to LC_COLLATE handling will
demonstrate that I am not a lone crackpot -- there are at least a few other
crackpots out there who agree. I'm willing to wager that most of us crackpots
don't mind the applications following LC_COLLATE -- we just think
LC_COLLATE=en_US (or other language-specific setting) by default will surprise
more system administrators than it will please. Setting LC_COLLATE=C in
/etc/sysconfig/i18n will placate all of us who are familiar with "the old way",
and will leave the newfangled fashion available for those who specifically
desire that behavior instead.

Just how many third-party scripts do you think there are that have been using
[a-z] for years? How many versions of scripts will people need to write so that
they can maintain portability to other platforms? (For platforms where [:upper:]
and [:lower:] aren't supported.)

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. touch a A b
2. ls [a-z]

Actual Results:  A  a  b


Expected Results:  a  b

Additional info:

Comment 1 Bill Nottingham 2002-09-04 03:52:27 UTC
No. The sorting order in the locales is the standard order used by people in
that locale for non-computer sorting for much longer than computers have been
around.


Comment 2 Ali-Reza Anghaie 2002-09-04 04:38:48 UTC
I'd consider it a RFE.

I don't know how people were doing the sorts before computers were around but
I'm pretty sure sysadmins weren't writing scripts before computers were around
(or not long before)  ;-) ..

I, for one, would expect "C" or "POSIX" to be the default. And I've seen plenty
of things written expecting as much. Even in RH 7.3 /etc/init.d/innd:

  # INN uses too many un-checked shell scripts
  unset LANG
  unset LC_COLLATE

It's my opinion that people who ~want~ it to behave as LC_COLLATE=en_US (or
whatever locale) can set that explicitly. Like the xinetd startup script does.

~I~ didn't find a place on the Opengroup site which specifies/recommends
LC_COLLATE settings. Hrmm. I just found lots of docs on how locales should
behave but not defaults. { There was some bits about POSIX being the fall-back.. }

In any case, I think Seth has a valid concern. I'm guessing more people will run
into trouble this way (en_US or what-not) and, in some cases, run in to trouble
that backups will have to get them out of. I'm also guessing more people will be
unhappy about this when it happens.

One more note... right now Red Hat is targetting the data center of the
enterprise, right? A lot of these places are moving from Solaris, AIX, IRIX,
Hockey-PUX, etc.

At least at Pratt & Whitney, where I currently work, a quick survey of non-Red
Hat boxen show "C" and "POSIX" more often than not. Even the Cobalt boxes (based
on RH 6.x) are "POSIX"..

I hope Red Hat will re-consider the default behavior. Thanks much, -Ali


Comment 3 Scott R. Godin 2002-09-04 18:52:51 UTC
I agree with the original poster. this is the behaviour I've come to expect from
my work with Perl regexes , and other unixen. 

If I *wanted* [[:ascii:]] or [[:alpha:]] I'd say so explicitly. I do NOT expect
[a-z] to arbitrarily include [A-Z] unless I ask *explicitly* for it. 

Setting the default LC_COLLATE="C" in /etc/sysconfig.i18n only makes sense.
those who WANT the non-standard-unix behaviour can easily set this otherwise,
but those of us who have come to expect a unix-like OS to behave like one,
having this not be the default (and not being warned about it) is a considerable
annoyance. 

I too, hope that Red Hat will consider the default behaviour.

Comment 4 seth arnold 2002-09-04 18:59:18 UTC
Bill, that is the problem: using "non-computer sorting" on a computer is highly
surprising.

Thanks.

Comment 5 Jesse Keating 2002-09-04 21:42:01 UTC
Wow, I would be really upset if rm [a-z]* deleted stuff that started w/ an
uppercase letter.  THis would really ruin the expected results.  What is the
technical reasoning for having [a-z] include [A-Z] as well?

Comment 6 Miloslav Trmac 2002-09-05 01:01:17 UTC
Go read the standards. Range expressions (such as [a-z]) are explicitly
undefined outside the C (POSIX) locale (SuSV3, XBD6 (Base Definitions), section
9.3.5 RE Bracket Expression, paragraph 7). This discussion has already been
beated at the Austin working group and elsewhere.

If you are a human, LC_COLLATE=en_US makes sense. If you are a script, set
LC_ALL=C at the beginning and be happy.

Comment 7 Jesse Keating 2002-09-05 01:47:59 UTC
I've always considered bash to be a constant script, with me feeding it line by
line.  Since the bash prompt doesn't take command such as: "Please go and find
every html file and convert the permissions so that the world can read, group
can read, and the owner can read and write.", I don't expect other aspects of
the bash prompt to be 'human' as well.  As I stated above, I, along with most
the development community I am associated with, expect things to be sorted as
the C or POSIX standard state.  [a-z] does _not_ include A-Z.  With all the work
that Red Hat does to preserve standards, I really feel that the ball was dropped
on this one.  It just adds one more step that one has to do to make Red Hat a
sane and useable OS again.