Bug 527427

Summary: sed reports syntax errors with some multibyte characters
Product: Red Hat Enterprise Linux 5 Reporter: john.haxby <john.haxby>
Component: sedAssignee: Vojtech Vitek <vvitek>
Status: CLOSED ERRATA QA Contact: BaseOS QE <qe-baseos-auto>
Severity: medium Docs Contact:
Priority: low    
Version: 5.4CC: hripps, ovasik, pbonzini, pmuller, rvokal
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-30 07:55:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch to fix this issue
none
Fix multibyte characters syntax error and possible infinite loop
none
Fix multibyte characters syntax error and possible infinite loop none

Description john.haxby@oracle.com 2009-10-06 11:40:34 UTC
Created attachment 363816 [details]
Patch to fix this issue

Description of problem:

Using a multibyte character that ends with 0x5c (backslash) can cause sed to report syntax errors.


Version-Release number of selected component (if applicable): sed-4.1.5-5


How reproducible:  Always


Steps to Reproduce:
1.  Start with your shell in a UTF-8 locale, eg en-US.UTF-8 (you can probably do this in a different locale, but it definitely works if you start in a UTF-8 locale).

2. Run the follow commands to construct a sed script:

   U2010=$(echo -ne '\x20\x10' | iconv -f ucs-2be)
   echo "echo '$U2010' | sed 's/$U2010/hyphen/g'" | iconv -t gbk > /tmp/script

3. Run the shell script in a locale that uses the gbk character set:

   LC_ALL=zh_CN.gbk sh /tmp/script 2>&1 | iconv -f gbk
  
Actual results:
   The script reports an error:

    sed:-e 表达式 #1,字符 13:unterminated `s' command

Expected results:

   The single word "hyphen"


Additional info:

The error arises because the character U+2010 (HYPHEN) is encoded as \xa9\x5c in the gbk encoding.  Sed sees the "\x5c" as a backslash escaping the following character which, in this case, is the "/" that we hope is going to terminate the pattern; it doesn't and so we get a syntax error.

Of course, this is just one character in one encoding.  There are likely to be many others and this is just one example.   I have another example for SJIS, (U+8868) but SJIS isn't a good encoding to use for reporting bugs :-).

This bug is alluded to in bug 524837 but it doesn't need a rebase to 4.2 to fix it.  The attached patch is taken from sed 4.2 and does in fact fix this problem and can be applied with out needing to rebase.

Comment 1 Paolo Bonzini 2009-10-12 12:35:08 UTC
I confirm that this is the same bug that is fixed by the attached upstream patch (i.e. not a coincidence :-).

Comment 2 john.haxby@oracle.com 2009-11-09 09:38:44 UTC
As bug 524837 has been declined for the current release, can we have this one fixed please?

Comment 5 Paolo Bonzini 2011-02-25 11:32:21 UTC
Upstream patch 20f68fb1abe862a98bc0378e5bb54d94bb98b8fe is also related to multibyte characters in sed.  If it applies to RHEL5 sed, it should be brought in as it can cause an infinite loop.

Comment 6 Vojtech Vitek 2011-03-09 14:54:23 UTC
Created attachment 483233 [details]
Fix multibyte characters syntax error and possible infinite loop

Comment 8 john.haxby@oracle.com 2011-03-09 15:16:13 UTC
Does that attachment work?  I haven't tried to compile sed with it, but it contains this:

+#define MBSINIT(s) \
+  (mb_cur_max == 1 ? 1 : mbsinit ((s))
+

There is a missing ")" there

Comment 9 Paolo Bonzini 2011-03-09 15:20:43 UTC
It might be that the other side of the #if is being used.  This would be a configuration problem for sed.

Comment 10 john.haxby@oracle.com 2011-03-09 15:33:51 UTC
I noticed the missing ")" when I was comparing my original patch to the updated one.  Even if it's not causing actual compilation problems it would be nice if the patch _looked_ right :-)

Comment 12 Vojtech Vitek 2011-03-09 15:42:40 UTC
(In reply to comment #8)
> Does that attachment work?  I haven't tried to compile sed with it, but it
> contains this:
> 
> +#define MBSINIT(s) \
> +  (mb_cur_max == 1 ? 1 : mbsinit ((s))
> +
> 
> There is a missing ")" there

Yes, you are right. It was just a typo in original commit.
I have actually applied a fixed version of patch to RHEL CVS tree, because running `make' failed with syntax error.

Anyway, thanks for the notification.

Comment 13 Vojtech Vitek 2011-03-09 15:45:18 UTC
Created attachment 483248 [details]
Fix multibyte characters syntax error and possible infinite loop

Comment 15 errata-xmlrpc 2011-03-30 07:55:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0397.html