Bug 440419

Summary: sed fails to properly process lines with non-ASCII characters
Product: [Fedora] Fedora Reporter: Massimiliano Pagani <maxpag>
Component: sedAssignee: Petr Machata <pmachata>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 8CC: mnewsome
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-06-25 14:50:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
This file contains a non-ASCII character on the third line. none

Description Massimiliano Pagani 2008-04-03 14:08:20 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13

Description of problem:
Sed apparently fails to apply search/replace with regular expressions when input lines contain non-ASCII characters.

Version-Release number of selected component (if applicable):
sed-4.1.5-9.fc8

How reproducible:
Always


Steps to Reproduce:
1. Dowload the attached text file. In the third line there is a non-ASCII character with value 0xe8.
2. Feed the text file to sed with the following line:

$ sed  -e "s/\".*$//g" text-with-non-ASCII-char.txt


Actual Results:
define(
define(
define("CDEF","Questo ? un bug
define(

NOTE: the '?' is actually the 0xe8 character code.

Expected Results:
define(
define(
define(
define(


Additional info:
I verified the sed package with "rpm -V sed" and found it ok.

Apparently sed behaviour is fine on ASCII and utf-8 character sets.

I tried the same file with cygwin sed (same 4.1.5 version) and it worked properly.

Comment 1 Massimiliano Pagani 2008-04-03 14:10:32 UTC
Created attachment 300253 [details]
This file contains a non-ASCII character on the third line.

You can use this file to reproduce the bug.

Comment 2 Petr Machata 2008-06-25 14:50:57 UTC
sed uses re_search to find next matching sequence that should be replaced. 
re_search is a libc function, and respects your current locale.  I guess you
have UTF-8 locale on your system, as is the default in Fedora.  In UTF-8, 0xe8
is not a valid character.  When processing such a files, you have to either
choose the right iso-8859-x encoding, or simply drop down to C encoding:

$ echo -e 'a\xe8a' | LANG=en_US.utf-8 ./sed/sed 's/./!/g'
!รจ!
$ echo -e 'a\xe8a' | LANG=en_US.iso-8859-1 ./sed/sed 's/./!/g'
!!!
$ echo -e 'a\xe8a' | LANG=C ./sed/sed 's/./!/g'
!!!