Bug 182400
Summary: | p.e. sed 's/\d160//' doesn't work as expected | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Karsten Hopp <karsten> | ||||
Component: | sed | Assignee: | Petr Machata <pmachata> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Ben Levenson <benl> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 5 | CC: | ivazqueznet, mnewsome | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2006-03-03 09:15:32 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Karsten Hopp
2006-02-22 11:50:54 UTC
I'm seeing something very similar in FC4 with 4.1.4-1. It looks like sed doesn't work properly with UTF-8. Please post the result of od -x against both files. First a correction: character 160 has been replaced by character 194, not 196 [tmp] >cat file Just a very s hort test file with odd characters. [tmp] >od -x file 0000000 754a 7473 a0c2 6120 7620 7265 2079 c273 0000020 68a0 726f 2074 6574 7473 0a20 6966 656c 0000040 7720 7469 2068 a0c2 646f 2064 6863 7261 0000060 6361 6574 7372 0a2e 0000070 [tmp] >od -x newfile 0000000 754a 7473 20c2 2061 6576 7972 7320 a0c2 0000020 6f68 7472 7420 7365 2074 660a 6c69 2065 0000040 6977 6874 c220 646f 2064 6863 7261 6361 0000060 6574 7372 0a2e 0000066 Created attachment 125023 [details]
file with ASCII 160 characters
Yes, it definitely looks like it's only matching by byte instead of character. It just so happens that the UTF-8 representation of \xa0 (which is \xc2\xa0) ends in the byte \xa0 and this is throwing everything off. Following your recipe, I created the file with a character #160. Vim encodes it to UTF-8, creating a two-byte sequence 194,160. Now I pipe that file through `sed s/\d160//`, which gives me file containing just the character 194, because 160 got erased. I think this is correct behavior. Let's see how it works on UTF-8 characters: $ echo "<A with acute, Ã>" | od -t x1 0000000 c3 81 0a $ echo "<O with acute, Ã>" | od -t x1 0000000 c3 93 0a $ echo "Ã" | sed 's/Ã/Ã/' | od -t x1 0000000 c3 93 0a result: ok, UTF-8-only text handling works as intended. Let's see how it works on binary files: $ echo "Ã" | sed 's/\x81/\x82/' | od -t x1 0000000 c3 82 0a $ echo "Ã" | sed 's/\xc3/\xc2/' | od -t x1 0000000 c2 81 0a result: ok, binary handling works as intended. For me, it just works. I think per-byte handling is feature of sed: if you work with UTF-8 files, use full characters or their full codes, and the work gets done. On binary files, just call the sequences by their value explicitly, and the work gets done, too. In UTF-8, the principle of non-overlap is defined. This means, that no character is represented by a byte sequence, that is itself contained in a byte sequence that represents other characters. So for example, when looking for character 'd', \x64, you are guaranteed that \x64 is not part of other code point. So for me, sed gets the job done as intended. I'm closing it as notabug. |