Bug 182400 - p.e. sed 's/\d160//' doesn't work as expected
p.e. sed 's/\d160//' doesn't work as expected
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: sed (Show other bugs)
5
All Linux
medium Severity medium
: ---
: ---
Assigned To: Petr Machata
Ben Levenson
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-02-22 06:50 EST by Karsten Hopp
Modified: 2015-05-04 21:32 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-03-03 04:15:32 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
file with ASCII 160 characters (56 bytes, text/plain)
2006-02-22 07:56 EST, Karsten Hopp
no flags Details

  None (edit)
Description Karsten Hopp 2006-02-22 06:50:54 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; de; rv:1.8.0.1) Gecko/20060202 Fedora/1.5.0.1-2 Firefox/1.5.0.1

Description of problem:
Replacing a character matched by its decimal ASCII value doesn't work

Version-Release number of selected component (if applicable):
sed-4.1.5-1

How reproducible:
Always

Steps to Reproduce:
1. edit a text file and add some weird characters (I chose the one with ASCII 160)
   With vim, you can enter those characters with CTRL-V160 while in insert mode
2. save file
3. try to delete these characters with sed:
   cat file | sed 's/\d160//' | newfile
4. open newfile (with vim) and watch how character 160 has been with character 196  (move cursor to character and press 'ga')
  

Actual Results:  ASCII charactre 160 replaced with ASCII character 196

Expected Results:  ASCII character 160 should have been removed

Additional info:
Comment 1 Ignacio Vazquez-Abrams 2006-02-22 07:12:58 EST
I'm seeing something very similar in FC4 with 4.1.4-1. It looks like sed doesn't
work properly with UTF-8. Please post the result of od -x against both files.
Comment 2 Karsten Hopp 2006-02-22 07:55:34 EST
First a correction: character 160 has been replaced by character 194, not 196

[tmp] >cat file
Just  a very s hort test
file with  odd characters.

[tmp] >od -x file
0000000 754a 7473 a0c2 6120 7620 7265 2079 c273
0000020 68a0 726f 2074 6574 7473 0a20 6966 656c
0000040 7720 7469 2068 a0c2 646f 2064 6863 7261
0000060 6361 6574 7372 0a2e
0000070
[tmp] >od -x newfile
0000000 754a 7473 20c2 2061 6576 7972 7320 a0c2
0000020 6f68 7472 7420 7365 2074 660a 6c69 2065
0000040 6977 6874 c220 646f 2064 6863 7261 6361
0000060 6574 7372 0a2e
0000066
Comment 3 Karsten Hopp 2006-02-22 07:56:35 EST
Created attachment 125023 [details]
file with ASCII 160 characters
Comment 4 Ignacio Vazquez-Abrams 2006-02-22 08:27:38 EST
Yes, it definitely looks like it's only matching by byte instead of character.
It just so happens that the UTF-8 representation of \xa0 (which is \xc2\xa0)
ends in the byte \xa0 and this is throwing everything off.
Comment 5 Petr Machata 2006-03-03 04:15:32 EST
Following your recipe, I created the file with a character #160. Vim encodes it
to UTF-8, creating a two-byte sequence 194,160. Now I pipe that file through
`sed s/\d160//`, which gives me file containing just the character 194, because
160 got erased. I think this is correct behavior.

Let's see how it works on UTF-8 characters:
$ echo "<A with acute, Á>" | od -t x1
0000000 c3 81 0a
$ echo "<O with acute, Ó>" | od -t x1
0000000 c3 93 0a
$ echo "Á" | sed 's/Á/Ó/' | od -t x1
0000000 c3 93 0a
result: ok, UTF-8-only text handling works as intended.

Let's see how it works on binary files:
$ echo "Á" | sed 's/\x81/\x82/' | od -t x1
0000000 c3 82 0a
$ echo "Á" | sed 's/\xc3/\xc2/' | od -t x1
0000000 c2 81 0a
result: ok, binary handling works as intended.

For me, it just works. I think per-byte handling is feature of sed: if you work
with UTF-8 files, use full characters or their full codes, and the work gets
done. On binary files, just call the sequences by their value explicitly, and
the work gets done, too.

In UTF-8, the principle of non-overlap is defined.  This means, that no
character is represented by a byte sequence, that is itself contained in a byte
sequence that represents other characters.  So for example, when looking for
character 'd', \x64, you are guaranteed that \x64 is not part of other code
point. So for me, sed gets the job done as intended.

I'm closing it as notabug.

Note You need to log in before you can comment on or make changes to this bug.