182400 – p.e. sed 's/\d160//' doesn't work as expected

Bug 182400 - p.e. sed 's/\d160//' doesn't work as expected

Summary: p.e. sed 's/\d160//' doesn't work as expected

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	sed
Sub Component:
Version:	5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Petr Machata
QA Contact:	Ben Levenson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-02-22 11:50 UTC by Karsten Hopp
Modified:	2015-05-05 01:32 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-03-03 09:15:32 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
file with ASCII 160 characters (56 bytes, text/plain) 2006-02-22 12:56 UTC, Karsten Hopp	no flags	Details
View All

Description Karsten Hopp 2006-02-22 11:50:54 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; de; rv:1.8.0.1) Gecko/20060202 Fedora/1.5.0.1-2 Firefox/1.5.0.1

Description of problem:
Replacing a character matched by its decimal ASCII value doesn't work

Version-Release number of selected component (if applicable):
sed-4.1.5-1

How reproducible:
Always

Steps to Reproduce:
1. edit a text file and add some weird characters (I chose the one with ASCII 160)
   With vim, you can enter those characters with CTRL-V160 while in insert mode
2. save file
3. try to delete these characters with sed:
   cat file | sed 's/\d160//' | newfile
4. open newfile (with vim) and watch how character 160 has been with character 196  (move cursor to character and press 'ga')
  

Actual Results:  ASCII charactre 160 replaced with ASCII character 196

Expected Results:  ASCII character 160 should have been removed

Additional info:

Comment 1 Ignacio Vazquez-Abrams 2006-02-22 12:12:58 UTC

I'm seeing something very similar in FC4 with 4.1.4-1. It looks like sed doesn't
work properly with UTF-8. Please post the result of od -x against both files.

Comment 2 Karsten Hopp 2006-02-22 12:55:34 UTC

First a correction: character 160 has been replaced by character 194, not 196

[tmp] >cat file
Just  a very s hort test
file with  odd characters.

[tmp] >od -x file
0000000 754a 7473 a0c2 6120 7620 7265 2079 c273
0000020 68a0 726f 2074 6574 7473 0a20 6966 656c
0000040 7720 7469 2068 a0c2 646f 2064 6863 7261
0000060 6361 6574 7372 0a2e
0000070
[tmp] >od -x newfile
0000000 754a 7473 20c2 2061 6576 7972 7320 a0c2
0000020 6f68 7472 7420 7365 2074 660a 6c69 2065
0000040 6977 6874 c220 646f 2064 6863 7261 6361
0000060 6574 7372 0a2e
0000066

Comment 3 Karsten Hopp 2006-02-22 12:56:35 UTC

Created attachment 125023 [details]
file with ASCII 160 characters

Comment 4 Ignacio Vazquez-Abrams 2006-02-22 13:27:38 UTC

Yes, it definitely looks like it's only matching by byte instead of character.
It just so happens that the UTF-8 representation of \xa0 (which is \xc2\xa0)
ends in the byte \xa0 and this is throwing everything off.

Comment 5 Petr Machata 2006-03-03 09:15:32 UTC

Following your recipe, I created the file with a character #160. Vim encodes it
to UTF-8, creating a two-byte sequence 194,160. Now I pipe that file through
`sed s/\d160//`, which gives me file containing just the character 194, because
160 got erased. I think this is correct behavior.

Let's see how it works on UTF-8 characters:
$ echo "<A with acute, Ã>" | od -t x1
0000000 c3 81 0a
$ echo "<O with acute, Ã>" | od -t x1
0000000 c3 93 0a
$ echo "Ã" | sed 's/Ã/Ã/' | od -t x1
0000000 c3 93 0a
result: ok, UTF-8-only text handling works as intended.

Let's see how it works on binary files:
$ echo "Ã" | sed 's/\x81/\x82/' | od -t x1
0000000 c3 82 0a
$ echo "Ã" | sed 's/\xc3/\xc2/' | od -t x1
0000000 c2 81 0a
result: ok, binary handling works as intended.

For me, it just works. I think per-byte handling is feature of sed: if you work
with UTF-8 files, use full characters or their full codes, and the work gets
done. On binary files, just call the sequences by their value explicitly, and
the work gets done, too.

In UTF-8, the principle of non-overlap is defined.  This means, that no
character is represented by a byte sequence, that is itself contained in a byte
sequence that represents other characters.  So for example, when looking for
character 'd', \x64, you are guaranteed that \x64 is not part of other code
point. So for me, sed gets the job done as intended.

I'm closing it as notabug.

Note You need to log in before you can comment on or make changes to this bug.