440419 – sed fails to properly process lines with non-ASCII characters

Bug 440419 - sed fails to properly process lines with non-ASCII characters

Summary: sed fails to properly process lines with non-ASCII characters

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	sed
Sub Component:
Version:	8
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Petr Machata
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-04-03 14:08 UTC by Massimiliano Pagani
Modified:	2015-05-05 01:33 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-06-25 14:50:57 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
This file contains a non-ASCII character on the third line. (120 bytes, application/octet-stream) 2008-04-03 14:10 UTC, Massimiliano Pagani	no flags	Details
View All

Description Massimiliano Pagani 2008-04-03 14:08:20 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13

Description of problem:
Sed apparently fails to apply search/replace with regular expressions when input lines contain non-ASCII characters.

Version-Release number of selected component (if applicable):
sed-4.1.5-9.fc8

How reproducible:
Always


Steps to Reproduce:
1. Dowload the attached text file. In the third line there is a non-ASCII character with value 0xe8.
2. Feed the text file to sed with the following line:

$ sed  -e "s/\".*$//g" text-with-non-ASCII-char.txt


Actual Results:
define(
define(
define("CDEF","Questo ? un bug
define(

NOTE: the '?' is actually the 0xe8 character code.

Expected Results:
define(
define(
define(
define(


Additional info:
I verified the sed package with "rpm -V sed" and found it ok.

Apparently sed behaviour is fine on ASCII and utf-8 character sets.

I tried the same file with cygwin sed (same 4.1.5 version) and it worked properly.

Comment 1 Massimiliano Pagani 2008-04-03 14:10:32 UTC

Created attachment 300253 [details]
This file contains a non-ASCII character on the third line.

You can use this file to reproduce the bug.

Comment 2 Petr Machata 2008-06-25 14:50:57 UTC

sed uses re_search to find next matching sequence that should be replaced. 
re_search is a libc function, and respects your current locale.  I guess you
have UTF-8 locale on your system, as is the default in Fedora.  In UTF-8, 0xe8
is not a valid character.  When processing such a files, you have to either
choose the right iso-8859-x encoding, or simply drop down to C encoding:

$ echo -e 'a\xe8a' | LANG=en_US.utf-8 ./sed/sed 's/./!/g'
!è!
$ echo -e 'a\xe8a' | LANG=en_US.iso-8859-1 ./sed/sed 's/./!/g'
!!!
$ echo -e 'a\xe8a' | LANG=C ./sed/sed 's/./!/g'
!!!

Note You need to log in before you can comment on or make changes to this bug.