Red Hat Bugzilla – Bug 440419
sed fails to properly process lines with non-ASCII characters
Last modified: 2015-05-04 21:33:52 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:18.104.22.168) Gecko/20080311 Firefox/22.214.171.124
Description of problem:
Sed apparently fails to apply search/replace with regular expressions when input lines contain non-ASCII characters.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Dowload the attached text file. In the third line there is a non-ASCII character with value 0xe8.
2. Feed the text file to sed with the following line:
$ sed -e "s/\".*$//g" text-with-non-ASCII-char.txt
define("CDEF","Questo ? un bug
NOTE: the '?' is actually the 0xe8 character code.
I verified the sed package with "rpm -V sed" and found it ok.
Apparently sed behaviour is fine on ASCII and utf-8 character sets.
I tried the same file with cygwin sed (same 4.1.5 version) and it worked properly.
Created attachment 300253 [details]
This file contains a non-ASCII character on the third line.
You can use this file to reproduce the bug.
sed uses re_search to find next matching sequence that should be replaced.
re_search is a libc function, and respects your current locale. I guess you
have UTF-8 locale on your system, as is the default in Fedora. In UTF-8, 0xe8
is not a valid character. When processing such a files, you have to either
choose the right iso-8859-x encoding, or simply drop down to C encoding:
$ echo -e 'a\xe8a' | LANG=en_US.utf-8 ./sed/sed 's/./!/g'
$ echo -e 'a\xe8a' | LANG=en_US.iso-8859-1 ./sed/sed 's/./!/g'
$ echo -e 'a\xe8a' | LANG=C ./sed/sed 's/./!/g'