Bug 440419 - sed fails to properly process lines with non-ASCII characters
sed fails to properly process lines with non-ASCII characters
Product: Fedora
Classification: Fedora
Component: sed (Show other bugs)
i686 Linux
low Severity medium
: ---
: ---
Assigned To: Petr Machata
Fedora Extras Quality Assurance
Depends On:
  Show dependency treegraph
Reported: 2008-04-03 10:08 EDT by Massimiliano Pagani
Modified: 2015-05-04 21:33 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-06-25 10:50:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
This file contains a non-ASCII character on the third line. (120 bytes, application/octet-stream)
2008-04-03 10:10 EDT, Massimiliano Pagani
no flags Details

  None (edit)
Description Massimiliano Pagani 2008-04-03 10:08:20 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv: Gecko/20080311 Firefox/

Description of problem:
Sed apparently fails to apply search/replace with regular expressions when input lines contain non-ASCII characters.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Dowload the attached text file. In the third line there is a non-ASCII character with value 0xe8.
2. Feed the text file to sed with the following line:

$ sed  -e "s/\".*$//g" text-with-non-ASCII-char.txt

Actual Results:
define("CDEF","Questo ? un bug

NOTE: the '?' is actually the 0xe8 character code.

Expected Results:

Additional info:
I verified the sed package with "rpm -V sed" and found it ok.

Apparently sed behaviour is fine on ASCII and utf-8 character sets.

I tried the same file with cygwin sed (same 4.1.5 version) and it worked properly.
Comment 1 Massimiliano Pagani 2008-04-03 10:10:32 EDT
Created attachment 300253 [details]
This file contains a non-ASCII character on the third line.

You can use this file to reproduce the bug.
Comment 2 Petr Machata 2008-06-25 10:50:57 EDT
sed uses re_search to find next matching sequence that should be replaced. 
re_search is a libc function, and respects your current locale.  I guess you
have UTF-8 locale on your system, as is the default in Fedora.  In UTF-8, 0xe8
is not a valid character.  When processing such a files, you have to either
choose the right iso-8859-x encoding, or simply drop down to C encoding:

$ echo -e 'a\xe8a' | LANG=en_US.utf-8 ./sed/sed 's/./!/g'
$ echo -e 'a\xe8a' | LANG=en_US.iso-8859-1 ./sed/sed 's/./!/g'
$ echo -e 'a\xe8a' | LANG=C ./sed/sed 's/./!/g'

Note You need to log in before you can comment on or make changes to this bug.