Bug 120933

Summary:	tr not multibyte aware at all
Product:	[Fedora] Fedora	Reporter:	Victor Ashik <victor>
Component:	coreutils	Assignee:	Tim Waugh <twaugh>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	rawhide	CC:	gajownik, mfabian, mitr, nscheibl
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-05-09 12:19:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Victor Ashik 2004-04-15 13:01:03 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Victor Ashik 2004-04-15 13:09:18 UTC

Description of problem:

some of locale definitions are possibly incorrect: cannout
transliterate from [:lower:] to [:upper:]

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:
1. export LANG=ru_RU.UTF-8
2. date | tr '[:lower:]' '[:upper:]'
3. export LANG=uk_UA.UTF-8
4. date | tr '[:lower:]' '[:upper:]'

Actual results:

lower case of non-latin letters

Expected results:

upper case of non-latin letters

Additional info:

Sorry for pressing commit by mistake ;-)

Comment 2 Victor Ashik 2004-04-23 16:53:55 UTC

It seems that this is not a problem in glibc locale, this is a 
problem in tr.

I found explanation here:

http://mail.nl.linux.org/linux-utf8/2003-08/msg00224.html

That's why I am trying to change component in this bug info.

Comment 3 Victor Ashik 2004-04-23 17:01:26 UTC

Here are problematic code lines in tr.c from coreutils-4.5.3 (RHEL3):

  1969               if (class_s1 == UL_LOWER && class_s2 == UL_UPPER)
   1970                 {
   1971                   for (i = 0; i < N_CHARS; i++)
   1972                     if (ISLOWER (i))
   1973                       xlate[i] = toupper (i);
   1974                 }
   1975               else if (class_s1 == UL_UPPER && class_s2 == 
UL_LOWER)
   1976                 {

The problem is in usage of tolower() and toupper().
   1977                   for (i = 0; i < N_CHARS; i++)
   1978                     if (ISUPPER (i))
   1979                       xlate[i] = tolower (i);
   1980                 }

Comment 4 Tim Waugh 2004-04-30 10:44:00 UTC

Thanks.  Actually there are other utilities in coreutils (like sort)
that are still using toupper/tolower for things as well.

Comment 5 Tim Waugh 2004-12-15 15:40:42 UTC

Actually sort seems to be fine because even though it mistakenly uses toupper()
on a wchar instead of towupper, it so happens that collation usually (always?)
ignores case in non-C locales.

Anyway, I've fixed the small issue in sort.

tr is trickier, since it knows nothing about non-ASCII encodings..

Comment 6 Miloslav Trmač 2005-04-26 21:09:07 UTC

I have no idea how to write Unicode-aware tr "well".

Consider
        tr '!"-@' '"-@!"
where @ is U+E007F.  The range has 917599 code points,
among those roughly 26238 (probably fewer) are currently defined
characters.  That's an awfully large table.  If you sacrifice the
O(1) conversions and decide to do that by parsing the translation
tables for each input character, you still have to know how many
characters there are in the range (and do code points reserved
for surrogates count as characters?).

Now consider non-Unicode multibyte encodings (CJK...).
The set of characters in that range is smaller, but there
doesn't seem to be any way to enumerate them in "encoding-lexicographic"
order (that's what tr currently does for single-byte encodings) without
knowing the structure of the encoding, other than attempting to convert
every unicode character to that encoding, sorting the results lexicographically
and using a subsequence as the range.

Similar issues arise with character classes.

The only "reasonable" way to define the behavior I can see is
requiring that each range/character class has its counterpart
in the other set, using the same number of characters.

Comment 7 Tim Waugh 2005-05-09 12:19:04 UTC

Hmm, complicated. :-(

Perhaps best to leave this to upstream then.

Comment 8 Tim Waugh 2006-02-28 11:34:20 UTC

*** Bug 183332 has been marked as a duplicate of this bug. ***