Bug 683519 - gawk regular expressions when using '-' in alpha range will match out of the given range
Summary: gawk regular expressions when using '-' in alpha range will match out of the ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: gawk
Version: 5.5
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: rc
: ---
Assignee: Vojtech Vitek
QA Contact: BaseOS QE - Apps
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-09 16:10 UTC by Mel
Modified: 2015-03-04 23:57 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-04-28 08:21:16 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Mel 2011-03-09 16:10:17 UTC
Description of problem:
gawk regular expressions when using '-' in alpha range match out of the given range, for example [A-Z] will match [A-Z] and [b-z]

Version-Release number of selected component (if applicable):
gawk-3.1.5-14
bug also exists in gawk-3.1.1-9

How reproducible:
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[A-Z]/,".");print $0}'
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[a-z]/,".");print $0}'
cat << MEL >foo
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
MEL
awk '/[A-Z]/{print "UPPER: " $0}' foo
awk '/[a-z]/{print "lower: " $0}' foo

Steps to Reproduce:
1. use a regular expression range using '-' in gawk to match just upper or lower case (ie: '[A-Z]' or '[a-z]' or '[b-d]', etc)
2.  
3. 
  
Actual results:
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[A-Z]/,".");print $0}'
..........................a.........................
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[a-z]/,".");print $0}'
.........................Z..........................


Expected results:
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[A-Z]/,".");print $0}'
..........................abcdefghijklmnopqrstuvwxyz
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[a-z]/,".");print $0}'
ABCDEFGHIJKLMNOPQRSTUVWXYZ..........................

Additional info:
'[A-Z]' acts as if it is '[A-Zb-z]'
'[a-z]' acts as if it is '[A-Ya-z]'
'[BC]' works, but '[B-C]' acts as if it is '[B-Cc]'
This has been broken for a while. I tested gawk-3.1.1-9, and it has the same problem.

Comment 1 Mel 2011-03-11 15:00:55 UTC
Fixed with
export LANG=C

Comment 2 Vojtech Vitek 2011-04-28 08:21:16 UTC
This is not a bug, but expected behaviour of non-C locales, especially their different collation order.

(In reply to comment #0)
> Additional info:
> '[A-Z]' acts as if it is '[A-Zb-z]'
> '[a-z]' acts as if it is '[A-Ya-z]'
> '[BC]' works, but '[B-C]' acts as if it is '[B-Cc]'
Note, that locale collation order can be something like "AaBbCcDd..", so [A-Z] range can expand to [AaBbCcDd..] and [B-C] can expand to [BbCc]. 


This behaviour was coincidently discussed upstream few days ago, see:
http://lists.gnu.org/archive/html/bug-gnu-utils/2011-04/msg00021.html

Citation of Aharon Robbins (Wed, 27 Apr 2011 21:48:41 +0300):
> I do agree that the behavior is suprising, disconcerting, undesirable,
> and so on.  For this reason, the upcoming version of gawk translates
> ranges of the form [d-h] into '[defgh]' before compiling the regular
> expression.


Advice:
To get plain ASCII ordering, you can use either [[:upper:]] or [[:lower:]] for entire ranges, or you can use explicit ranges, such as [CDEFG], or finally you can use LC_ALL=C as a quick workaround.


Closing as NOTABUG.


Note You need to log in before you can comment on or make changes to this bug.