120933 – tr not multibyte aware at all

Bug 120933 - tr not multibyte aware at all

Summary: tr not multibyte aware at all

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	coreutils
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Tim Waugh
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	183332 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-04-15 13:01 UTC by Victor Ashik
Modified:	2015-05-10 23:40 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-05-09 12:19:04 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Victor Ashik 2004-04-15 13:01:03 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Victor Ashik 2004-04-15 13:09:18 UTC

Description of problem:

some of locale definitions are possibly incorrect: cannout
transliterate from [:lower:] to [:upper:]

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:
1. export LANG=ru_RU.UTF-8
2. date | tr '[:lower:]' '[:upper:]'
3. export LANG=uk_UA.UTF-8
4. date | tr '[:lower:]' '[:upper:]'

Actual results:

lower case of non-latin letters

Expected results:

upper case of non-latin letters

Additional info:

Sorry for pressing commit by mistake ;-)

Comment 2 Victor Ashik 2004-04-23 16:53:55 UTC

It seems that this is not a problem in glibc locale, this is a 
problem in tr.

I found explanation here:

http://mail.nl.linux.org/linux-utf8/2003-08/msg00224.html

That's why I am trying to change component in this bug info.

Comment 3 Victor Ashik 2004-04-23 17:01:26 UTC

Here are problematic code lines in tr.c from coreutils-4.5.3 (RHEL3):

  1969               if (class_s1 == UL_LOWER && class_s2 == UL_UPPER)
   1970                 {
   1971                   for (i = 0; i < N_CHARS; i++)
   1972                     if (ISLOWER (i))
   1973                       xlate[i] = toupper (i);
   1974                 }
   1975               else if (class_s1 == UL_UPPER && class_s2 == 
UL_LOWER)
   1976                 {

The problem is in usage of tolower() and toupper().
   1977                   for (i = 0; i < N_CHARS; i++)
   1978                     if (ISUPPER (i))
   1979                       xlate[i] = tolower (i);
   1980                 }

Comment 4 Tim Waugh 2004-04-30 10:44:00 UTC

Thanks.  Actually there are other utilities in coreutils (like sort)
that are still using toupper/tolower for things as well.

Comment 5 Tim Waugh 2004-12-15 15:40:42 UTC

Actually sort seems to be fine because even though it mistakenly uses toupper()
on a wchar instead of towupper, it so happens that collation usually (always?)
ignores case in non-C locales.

Anyway, I've fixed the small issue in sort.

tr is trickier, since it knows nothing about non-ASCII encodings..

Comment 6 Miloslav Trmač 2005-04-26 21:09:07 UTC

I have no idea how to write Unicode-aware tr "well".

Consider
        tr '!"-@' '"-@!"
where @ is U+E007F.  The range has 917599 code points,
among those roughly 26238 (probably fewer) are currently defined
characters.  That's an awfully large table.  If you sacrifice the
O(1) conversions and decide to do that by parsing the translation
tables for each input character, you still have to know how many
characters there are in the range (and do code points reserved
for surrogates count as characters?).

Now consider non-Unicode multibyte encodings (CJK...).
The set of characters in that range is smaller, but there
doesn't seem to be any way to enumerate them in "encoding-lexicographic"
order (that's what tr currently does for single-byte encodings) without
knowing the structure of the encoding, other than attempting to convert
every unicode character to that encoding, sorting the results lexicographically
and using a subsequence as the range.

Similar issues arise with character classes.

The only "reasonable" way to define the behavior I can see is
requiring that each range/character class has its counterpart
in the other set, using the same number of characters.

Comment 7 Tim Waugh 2005-05-09 12:19:04 UTC

Hmm, complicated. :-(

Perhaps best to leave this to upstream then.

Comment 8 Tim Waugh 2006-02-28 11:34:20 UTC

*** Bug 183332 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.