Bug 440419

Summary:

sed fails to properly process lines with non-ASCII characters

Product:

[Fedora] Fedora

Reporter:

Massimiliano Pagani <maxpag>

Component:

sed

Assignee:

Petr Machata <pmachata>

Status:

CLOSED NOTABUG

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

low

Version:

CC:

mnewsome

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-06-25 14:50:57 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
This file contains a non-ASCII character on the third line.	none

Description Massimiliano Pagani 2008-04-03 14:08:20 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13

Description of problem:
Sed apparently fails to apply search/replace with regular expressions when input lines contain non-ASCII characters.

Version-Release number of selected component (if applicable):
sed-4.1.5-9.fc8

How reproducible:
Always


Steps to Reproduce:
1. Dowload the attached text file. In the third line there is a non-ASCII character with value 0xe8.
2. Feed the text file to sed with the following line:

$ sed  -e "s/\".*$//g" text-with-non-ASCII-char.txt


Actual Results:
define(
define(
define("CDEF","Questo ? un bug
define(

NOTE: the '?' is actually the 0xe8 character code.

Expected Results:
define(
define(
define(
define(


Additional info:
I verified the sed package with "rpm -V sed" and found it ok.

Apparently sed behaviour is fine on ASCII and utf-8 character sets.

I tried the same file with cygwin sed (same 4.1.5 version) and it worked properly.

Comment 1 Massimiliano Pagani 2008-04-03 14:10:32 UTC

Created attachment 300253 [details]
This file contains a non-ASCII character on the third line.

You can use this file to reproduce the bug.

Comment 2 Petr Machata 2008-06-25 14:50:57 UTC

sed uses re_search to find next matching sequence that should be replaced. 
re_search is a libc function, and respects your current locale.  I guess you
have UTF-8 locale on your system, as is the default in Fedora.  In UTF-8, 0xe8
is not a valid character.  When processing such a files, you have to either
choose the right iso-8859-x encoding, or simply drop down to C encoding:

$ echo -e 'a\xe8a' | LANG=en_US.utf-8 ./sed/sed 's/./!/g'
!è!
$ echo -e 'a\xe8a' | LANG=en_US.iso-8859-1 ./sed/sed 's/./!/g'
!!!
$ echo -e 'a\xe8a' | LANG=C ./sed/sed 's/./!/g'
!!!