Bug 527427 - sed reports syntax errors with some multibyte characters
sed reports syntax errors with some multibyte characters
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: sed (Show other bugs)
5.4
All Linux
low Severity medium
: rc
: ---
Assigned To: Vojtech Vitek
BaseOS QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-10-06 07:40 EDT by john.haxby@oracle.com
Modified: 2015-03-04 18:56 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-03-30 03:55:41 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Patch to fix this issue (1.48 KB, patch)
2009-10-06 07:40 EDT, john.haxby@oracle.com
no flags Details | Diff
Fix multibyte characters syntax error and possible infinite loop (3.29 KB, patch)
2011-03-09 09:54 EST, Vojtech Vitek
no flags Details | Diff
Fix multibyte characters syntax error and possible infinite loop (3.29 KB, patch)
2011-03-09 10:45 EST, Vojtech Vitek
no flags Details | Diff

  None (edit)
Description john.haxby@oracle.com 2009-10-06 07:40:34 EDT
Created attachment 363816 [details]
Patch to fix this issue

Description of problem:

Using a multibyte character that ends with 0x5c (backslash) can cause sed to report syntax errors.


Version-Release number of selected component (if applicable): sed-4.1.5-5


How reproducible:  Always


Steps to Reproduce:
1.  Start with your shell in a UTF-8 locale, eg en-US.UTF-8 (you can probably do this in a different locale, but it definitely works if you start in a UTF-8 locale).

2. Run the follow commands to construct a sed script:

   U2010=$(echo -ne '\x20\x10' | iconv -f ucs-2be)
   echo "echo '$U2010' | sed 's/$U2010/hyphen/g'" | iconv -t gbk > /tmp/script

3. Run the shell script in a locale that uses the gbk character set:

   LC_ALL=zh_CN.gbk sh /tmp/script 2>&1 | iconv -f gbk
  
Actual results:
   The script reports an error:

    sed:-e 表达式 #1,字符 13:unterminated `s' command

Expected results:

   The single word "hyphen"


Additional info:

The error arises because the character U+2010 (HYPHEN) is encoded as \xa9\x5c in the gbk encoding.  Sed sees the "\x5c" as a backslash escaping the following character which, in this case, is the "/" that we hope is going to terminate the pattern; it doesn't and so we get a syntax error.

Of course, this is just one character in one encoding.  There are likely to be many others and this is just one example.   I have another example for SJIS, (U+8868) but SJIS isn't a good encoding to use for reporting bugs :-).

This bug is alluded to in bug 524837 but it doesn't need a rebase to 4.2 to fix it.  The attached patch is taken from sed 4.2 and does in fact fix this problem and can be applied with out needing to rebase.
Comment 1 Paolo Bonzini 2009-10-12 08:35:08 EDT
I confirm that this is the same bug that is fixed by the attached upstream patch (i.e. not a coincidence :-).
Comment 2 john.haxby@oracle.com 2009-11-09 04:38:44 EST
As bug 524837 has been declined for the current release, can we have this one fixed please?
Comment 5 Paolo Bonzini 2011-02-25 06:32:21 EST
Upstream patch 20f68fb1abe862a98bc0378e5bb54d94bb98b8fe is also related to multibyte characters in sed.  If it applies to RHEL5 sed, it should be brought in as it can cause an infinite loop.
Comment 6 Vojtech Vitek 2011-03-09 09:54:23 EST
Created attachment 483233 [details]
Fix multibyte characters syntax error and possible infinite loop
Comment 8 john.haxby@oracle.com 2011-03-09 10:16:13 EST
Does that attachment work?  I haven't tried to compile sed with it, but it contains this:

+#define MBSINIT(s) \
+  (mb_cur_max == 1 ? 1 : mbsinit ((s))
+

There is a missing ")" there
Comment 9 Paolo Bonzini 2011-03-09 10:20:43 EST
It might be that the other side of the #if is being used.  This would be a configuration problem for sed.
Comment 10 john.haxby@oracle.com 2011-03-09 10:33:51 EST
I noticed the missing ")" when I was comparing my original patch to the updated one.  Even if it's not causing actual compilation problems it would be nice if the patch _looked_ right :-)
Comment 12 Vojtech Vitek 2011-03-09 10:42:40 EST
(In reply to comment #8)
> Does that attachment work?  I haven't tried to compile sed with it, but it
> contains this:
> 
> +#define MBSINIT(s) \
> +  (mb_cur_max == 1 ? 1 : mbsinit ((s))
> +
> 
> There is a missing ")" there

Yes, you are right. It was just a typo in original commit.
I have actually applied a fixed version of patch to RHEL CVS tree, because running `make' failed with syntax error.

Anyway, thanks for the notification.
Comment 13 Vojtech Vitek 2011-03-09 10:45:18 EST
Created attachment 483248 [details]
Fix multibyte characters syntax error and possible infinite loop
Comment 15 errata-xmlrpc 2011-03-30 03:55:41 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0397.html

Note You need to log in before you can comment on or make changes to this bug.