Bug 120933
Summary: | tr not multibyte aware at all | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Victor Ashik <victor> |
Component: | coreutils | Assignee: | Tim Waugh <twaugh> |
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | rawhide | CC: | gajownik, mfabian, mitr, nscheibl |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-05-09 12:19:04 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Victor Ashik
2004-04-15 13:01:03 UTC
Description of problem: some of locale definitions are possibly incorrect: cannout transliterate from [:lower:] to [:upper:] Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. export LANG=ru_RU.UTF-8 2. date | tr '[:lower:]' '[:upper:]' 3. export LANG=uk_UA.UTF-8 4. date | tr '[:lower:]' '[:upper:]' Actual results: lower case of non-latin letters Expected results: upper case of non-latin letters Additional info: Sorry for pressing commit by mistake ;-) It seems that this is not a problem in glibc locale, this is a problem in tr. I found explanation here: http://mail.nl.linux.org/linux-utf8/2003-08/msg00224.html That's why I am trying to change component in this bug info. Here are problematic code lines in tr.c from coreutils-4.5.3 (RHEL3): 1969 if (class_s1 == UL_LOWER && class_s2 == UL_UPPER) 1970 { 1971 for (i = 0; i < N_CHARS; i++) 1972 if (ISLOWER (i)) 1973 xlate[i] = toupper (i); 1974 } 1975 else if (class_s1 == UL_UPPER && class_s2 == UL_LOWER) 1976 { The problem is in usage of tolower() and toupper(). 1977 for (i = 0; i < N_CHARS; i++) 1978 if (ISUPPER (i)) 1979 xlate[i] = tolower (i); 1980 } Thanks. Actually there are other utilities in coreutils (like sort) that are still using toupper/tolower for things as well. Actually sort seems to be fine because even though it mistakenly uses toupper() on a wchar instead of towupper, it so happens that collation usually (always?) ignores case in non-C locales. Anyway, I've fixed the small issue in sort. tr is trickier, since it knows nothing about non-ASCII encodings.. I have no idea how to write Unicode-aware tr "well". Consider tr '!"-@' '"-@!" where @ is U+E007F. The range has 917599 code points, among those roughly 26238 (probably fewer) are currently defined characters. That's an awfully large table. If you sacrifice the O(1) conversions and decide to do that by parsing the translation tables for each input character, you still have to know how many characters there are in the range (and do code points reserved for surrogates count as characters?). Now consider non-Unicode multibyte encodings (CJK...). The set of characters in that range is smaller, but there doesn't seem to be any way to enumerate them in "encoding-lexicographic" order (that's what tr currently does for single-byte encodings) without knowing the structure of the encoding, other than attempting to convert every unicode character to that encoding, sorting the results lexicographically and using a subsequence as the range. Similar issues arise with character classes. The only "reasonable" way to define the behavior I can see is requiring that each range/character class has its counterpart in the other set, using the same number of characters. Hmm, complicated. :-( Perhaps best to leave this to upstream then. *** Bug 183332 has been marked as a duplicate of this bug. *** |