Bug 88868 - less using latin1 charset doesn't display binary files correctly
Summary: less using latin1 charset doesn't display binary files correctly
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: less
Version: 9
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Karsten Hopp
QA Contact: Mike McLean
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-04-15 03:32 UTC by System Detection Staff
Modified: 2007-04-18 16:53 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-03-29 09:09:03 UTC
Embargoed:


Attachments (Terms of Use)

Description System Detection Staff 2003-04-15 03:32:51 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225

Description of problem:
When less is configured to use the latin1 characterset via the LESSCHARSET
environmental variable (or -Klatin1), several high-bit characters are displayed
as "&" instead of something more useful.  I accept the need for foreign language
support, but latin1 seems pretty standard and regardless it should display
*something* different for different characters instead of substituting some
constant character.

While unsetting LESSCHARSET appears like a potential workaround, this prevents
the displaying of files containing the european character set.

Substitution of iso8859 is, as you would hope, exactly equivelent.

This was tried both running in an xterm and running on the console.

Note that less 381 (from GNU) and 358+iso247 (RH 7.1) both do the right thing.

Version-Release number of selected component (if applicable):
less-378-7

How reproducible:
Always

Steps to Reproduce:
1.echo -e 'A\201\nA\253\nA&' | /usr/bin/less -Klatin1


Actual Results:  The second and third lines appear the same "A&"


Expected Results:  All three lines should appear differently, either like:

----
A<81>
A<AB>
A&
----
or
----
A<81>
A�
A&
----

Additional info:

Comment 1 David Nečas 2003-04-19 22:27:49 UTC
I wouldn't say some 8bit characters display as &, but all 8bit characters
display as &. I have LC_CTYPE=cs_CZ, Latin2 fonts in terminal,
LESSCHARSET=iso8859. Everything displays Latin2 characters correctly (except less).

echo -e $(echo \\{2{4,5,6,7},3{0,1,2,3,4,5,6,7}}{0,1,2,3,4,5,6,7} | tr -d ' ') 

prints a list of Latin2 characters (probably not reasonably pasteable here),

echo -e $(echo \\{2{4,5,6,7},3{0,1,2,3,4,5,6,7}}{0,1,2,3,4,5,6,7} | tr -d ' ') |
less

prints a list of 96 &.

Unsetting LESSCHARSET has an interesting effect. It makes *some* characters to
display properly -- it seems those present also in Latin1; Latin2 specific
characters are still shown as &.


Comment 2 Tomasz Kepczynski 2003-04-30 06:28:10 UTC
I have the same problem for Polish locale pl_PL as well. All Polish diacritical 
marks except &Oacute; and &oacute; are displayed incorrectly.

Comment 3 Dale Stimson 2003-05-04 19:46:59 UTC
I'm running with LANG="en_US.UTF-8".
I see this issue all the time with the man command (since less is the default
pager) for characters that could translate as enclosing single quotes.  An
example is "man modprobe", for which with the default setting of 
LESSCHARSET=latin1 one sees:
      -k, --autoclean
              Set  &&&autoclean&&&  on loaded modules.
and with LESSCHARSET undefined:
       -k, --autoclean
              Set  âautocleanâ  on loaded modules.

The actual characters given to less for the relevant part of this (per "od -c") are:

S   e   t         342 200 231   a   u   t   o   c   l   e   a   n 342 200 231  
        o   n 

 


Comment 4 Victor Ashik 2003-05-06 15:27:37 UTC
Bug is added by patches less-378+iso247-20030108.diff and/or less-378-multibyte.patch - if 
less-378 is compiled without those patches - everything looks OK. 

Comment 5 Jungshik Shin 2003-09-01 06:48:24 UTC
Re comment #3, if your locale is en_US.UTF-8(as opposed to en_US.ISO8859-1),
you're not supposed to set LESSCHARSET to latin1. As your 'od -c' output showed,
you're feeding a UTF-8 stream to less 'lying to less that you're feeding
ISO-8859-1' by settig LESSCHARSET to latin1. 

If you need to view files genuinely in ISO-8859-1, you have to filter it through
'iconv' first (i.e. 'iconv -f ISO-8859-1 -t UTF-8 | less') 

This is not to say that less (in RH 9.0) is not 'buggy'.It is confusing to
end-users because its behavior is controlled by multiple factors, LESSCHARSEt,
the current locale setting and who knows what. IMO, it should IGNORE
LESSCHARSET/JLESSCHARSET on POSIX-compliant systems(Linux witgh glibc 2.2 or
later is one of them)  where nl_langinfo(CODESET) is supported. 

Moreover, the last time I checked its way of determining the current codeset was
buggy. The priority  should be given in the following order, but it gave a
higher priority to LANG IIRC. 

  1. nl_langinfo(CODESET)
  2. LC_ALL
  3. LC_CTYPE
  4. LANG 





> but latin1 seems pretty standard 

  This is rather a Western-Euro-centric point of view :-) What less needs is
'raw or 'binary' view option. 





Comment 6 Stepan Kasal 2003-11-04 09:05:59 UTC
Re <a href="https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=88868#c4">comment #4</a>, the bug is produced by less-378-multibyte.patch .
When I recompiled less without it, the problem disappeared.
When I recompiled less from rawhide, with all the patches, the character `&' has changed to `6'--I gather it's a value from random spot of the memory.

Comment 7 Karsten Hopp 2004-03-29 09:09:03 UTC
I've updated to less 381 and dropped the multibyte patch, so this 
report can be closed 


Note You need to log in before you can comment on or make changes to this bug.