Bug 440419 - sed fails to properly process lines with non-ASCII characters
Summary: sed fails to properly process lines with non-ASCII characters
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: sed
Version: 8
Hardware: i686
OS: Linux
low
medium
Target Milestone: ---
Assignee: Petr Machata
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-04-03 14:08 UTC by Massimiliano Pagani
Modified: 2015-05-05 01:33 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-06-25 14:50:57 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
This file contains a non-ASCII character on the third line. (120 bytes, application/octet-stream)
2008-04-03 14:10 UTC, Massimiliano Pagani
no flags Details

Description Massimiliano Pagani 2008-04-03 14:08:20 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13

Description of problem:
Sed apparently fails to apply search/replace with regular expressions when input lines contain non-ASCII characters.

Version-Release number of selected component (if applicable):
sed-4.1.5-9.fc8

How reproducible:
Always


Steps to Reproduce:
1. Dowload the attached text file. In the third line there is a non-ASCII character with value 0xe8.
2. Feed the text file to sed with the following line:

$ sed  -e "s/\".*$//g" text-with-non-ASCII-char.txt


Actual Results:
define(
define(
define("CDEF","Questo ? un bug
define(

NOTE: the '?' is actually the 0xe8 character code.

Expected Results:
define(
define(
define(
define(


Additional info:
I verified the sed package with "rpm -V sed" and found it ok.

Apparently sed behaviour is fine on ASCII and utf-8 character sets.

I tried the same file with cygwin sed (same 4.1.5 version) and it worked properly.

Comment 1 Massimiliano Pagani 2008-04-03 14:10:32 UTC
Created attachment 300253 [details]
This file contains a non-ASCII character on the third line.

You can use this file to reproduce the bug.

Comment 2 Petr Machata 2008-06-25 14:50:57 UTC
sed uses re_search to find next matching sequence that should be replaced. 
re_search is a libc function, and respects your current locale.  I guess you
have UTF-8 locale on your system, as is the default in Fedora.  In UTF-8, 0xe8
is not a valid character.  When processing such a files, you have to either
choose the right iso-8859-x encoding, or simply drop down to C encoding:

$ echo -e 'a\xe8a' | LANG=en_US.utf-8 ./sed/sed 's/./!/g'
!รจ!
$ echo -e 'a\xe8a' | LANG=en_US.iso-8859-1 ./sed/sed 's/./!/g'
!!!
$ echo -e 'a\xe8a' | LANG=C ./sed/sed 's/./!/g'
!!!


Note You need to log in before you can comment on or make changes to this bug.