From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; de; rv:1.8.0.1) Gecko/20060202 Fedora/1.5.0.1-2 Firefox/1.5.0.1 Description of problem: Replacing a character matched by its decimal ASCII value doesn't work Version-Release number of selected component (if applicable): sed-4.1.5-1 How reproducible: Always Steps to Reproduce: 1. edit a text file and add some weird characters (I chose the one with ASCII 160) With vim, you can enter those characters with CTRL-V160 while in insert mode 2. save file 3. try to delete these characters with sed: cat file | sed 's/\d160//' | newfile 4. open newfile (with vim) and watch how character 160 has been with character 196 (move cursor to character and press 'ga') Actual Results: ASCII charactre 160 replaced with ASCII character 196 Expected Results: ASCII character 160 should have been removed Additional info:
I'm seeing something very similar in FC4 with 4.1.4-1. It looks like sed doesn't work properly with UTF-8. Please post the result of od -x against both files.
First a correction: character 160 has been replaced by character 194, not 196 [tmp] >cat file Just a very s hort test file with odd characters. [tmp] >od -x file 0000000 754a 7473 a0c2 6120 7620 7265 2079 c273 0000020 68a0 726f 2074 6574 7473 0a20 6966 656c 0000040 7720 7469 2068 a0c2 646f 2064 6863 7261 0000060 6361 6574 7372 0a2e 0000070 [tmp] >od -x newfile 0000000 754a 7473 20c2 2061 6576 7972 7320 a0c2 0000020 6f68 7472 7420 7365 2074 660a 6c69 2065 0000040 6977 6874 c220 646f 2064 6863 7261 6361 0000060 6574 7372 0a2e 0000066
Created attachment 125023 [details] file with ASCII 160 characters
Yes, it definitely looks like it's only matching by byte instead of character. It just so happens that the UTF-8 representation of \xa0 (which is \xc2\xa0) ends in the byte \xa0 and this is throwing everything off.
Following your recipe, I created the file with a character #160. Vim encodes it to UTF-8, creating a two-byte sequence 194,160. Now I pipe that file through `sed s/\d160//`, which gives me file containing just the character 194, because 160 got erased. I think this is correct behavior. Let's see how it works on UTF-8 characters: $ echo "<A with acute, Ã>" | od -t x1 0000000 c3 81 0a $ echo "<O with acute, Ã>" | od -t x1 0000000 c3 93 0a $ echo "Ã" | sed 's/Ã/Ã/' | od -t x1 0000000 c3 93 0a result: ok, UTF-8-only text handling works as intended. Let's see how it works on binary files: $ echo "Ã" | sed 's/\x81/\x82/' | od -t x1 0000000 c3 82 0a $ echo "Ã" | sed 's/\xc3/\xc2/' | od -t x1 0000000 c2 81 0a result: ok, binary handling works as intended. For me, it just works. I think per-byte handling is feature of sed: if you work with UTF-8 files, use full characters or their full codes, and the work gets done. On binary files, just call the sequences by their value explicitly, and the work gets done, too. In UTF-8, the principle of non-overlap is defined. This means, that no character is represented by a byte sequence, that is itself contained in a byte sequence that represents other characters. So for example, when looking for character 'd', \x64, you are guaranteed that \x64 is not part of other code point. So for me, sed gets the job done as intended. I'm closing it as notabug.