155382 – unexpand: non-ASCII data corruption

Bug 155382 - unexpand: non-ASCII data corruption

Summary: unexpand: non-ASCII data corruption

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	coreutils
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Tim Waugh
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-04-19 18:45 UTC by Ville Skyttä
Modified:	2007-11-30 22:11 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-04-21 12:29:26 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Test file (38 bytes, application/octet-stream) 2005-04-20 16:36 UTC, Ville Skyttä	no flags	Details
View All

Description Ville Skyttä 2005-04-19 18:45:40 UTC

In coreutils-5.2.1-44 (also earlier versions): when fed eg. ISO-8859-1 text on a
UTF-8 system, unexpand corrupts non-ASCII chars.

  $ echo $LANG
  en_US.UTF-8
  $ echo SkyttÃ¤ \
  | iconv -f utf-8 -t iso-8859-1 \
  | unexpand \
  | iconv -f iso-8859-1 -t utf-8
  SkyttS

expand(1) does not seem to suffer from this.

Comment 1 Tim Waugh 2005-04-20 08:25:33 UTC

Why are you converting charset in the first place?  Doesn't expand work in
UTF-8?  If so, that's the real bug.

Comment 2 Ville Skyttä 2005-04-20 16:36:08 UTC

Created attachment 113419 [details]
Test file

The charset conversion here is just for the purpose of providing a oneliner
that easily reproduces the bug.

The full practical reproducer is:

- I have some source code that contains non-ASCII ISO-8859-1 accented 
  characters.  ISO-8859-1 is fine for these files, and must not be changed.
- I want to convert spaces to tabs in those files using unexpand.
- When I run unexpand on these files, the spaces get converted to tabs as 
  expected, but as a side effect, all the ISO-8859-1 accented chars turn into
  gibberish.

Please also note that it's unexpand(1) that has the bug.  Not expand(1), it
seems to be doing the right thing.

Try it out with the attached file:

  gunzip 155382.txt.gz
  unexpand 155382.txt > test.out
  # compare 155382.txt and test.out, they should be equivalent, but are not

Comment 3 Tim Waugh 2005-04-21 10:25:50 UTC

Works fine here, when the locale is set correctly.  Your test case is incorrect,
and should be using 'LC_CTYPE=en_US unexpand' as in:

$ echo $LANG
  en_US.UTF-8
  $ echo SkyttÃ¤ \
  | iconv -f utf-8 -t iso-8859-1 \
  | LC_ALL=en_US unexpand \
  | iconv -f iso-8859-1 -t utf-8

Otherwise unexpand will be expecting UTF-8 input.

Comment 4 Ville Skyttä 2005-04-21 12:20:02 UTC

Well, I still think this is a bug because unlike unexpand, expand is doing the
right thing, not changing anything but the spaces or tabs without any LC_*
environment fiddling in this case.

Reopening based on that.  If you disagree, I'd like to hear why expand and
unexpand are expected to behave differently regarding these accented characters
with the same environment settings and same input.

Comment 5 Tim Waugh 2005-04-21 12:29:26 UTC

No, you *must* set the encoding correctly, otherwise all bets are off.

You can't expect expand or unexpand to be able to do their job if they cannot
make sense of the input.  Garbage in, garbage out.

Note You need to log in before you can comment on or make changes to this bug.