Bug 683519

Summary: gawk regular expressions when using '-' in alpha range will match out of the given range
Product: Red Hat Enterprise Linux 5 Reporter: Mel <mel>
Component: gawkAssignee: Vojtech Vitek <vvitek>
Status: CLOSED NOTABUG QA Contact: BaseOS QE - Apps <qe-baseos-apps>
Severity: low Docs Contact:
Priority: unspecified    
Version: 5.5CC: hripps
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-04-28 08:21:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Mel 2011-03-09 16:10:17 UTC
Description of problem:
gawk regular expressions when using '-' in alpha range match out of the given range, for example [A-Z] will match [A-Z] and [b-z]

Version-Release number of selected component (if applicable):
gawk-3.1.5-14
bug also exists in gawk-3.1.1-9

How reproducible:
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[A-Z]/,".");print $0}'
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[a-z]/,".");print $0}'
cat << MEL >foo
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
MEL
awk '/[A-Z]/{print "UPPER: " $0}' foo
awk '/[a-z]/{print "lower: " $0}' foo

Steps to Reproduce:
1. use a regular expression range using '-' in gawk to match just upper or lower case (ie: '[A-Z]' or '[a-z]' or '[b-d]', etc)
2.  
3. 
  
Actual results:
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[A-Z]/,".");print $0}'
..........................a.........................
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[a-z]/,".");print $0}'
.........................Z..........................


Expected results:
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[A-Z]/,".");print $0}'
..........................abcdefghijklmnopqrstuvwxyz
echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'|
awk '{gsub(/[a-z]/,".");print $0}'
ABCDEFGHIJKLMNOPQRSTUVWXYZ..........................

Additional info:
'[A-Z]' acts as if it is '[A-Zb-z]'
'[a-z]' acts as if it is '[A-Ya-z]'
'[BC]' works, but '[B-C]' acts as if it is '[B-Cc]'
This has been broken for a while. I tested gawk-3.1.1-9, and it has the same problem.

Comment 1 Mel 2011-03-11 15:00:55 UTC
Fixed with
export LANG=C

Comment 2 Vojtech Vitek 2011-04-28 08:21:16 UTC
This is not a bug, but expected behaviour of non-C locales, especially their different collation order.

(In reply to comment #0)
> Additional info:
> '[A-Z]' acts as if it is '[A-Zb-z]'
> '[a-z]' acts as if it is '[A-Ya-z]'
> '[BC]' works, but '[B-C]' acts as if it is '[B-Cc]'
Note, that locale collation order can be something like "AaBbCcDd..", so [A-Z] range can expand to [AaBbCcDd..] and [B-C] can expand to [BbCc]. 


This behaviour was coincidently discussed upstream few days ago, see:
http://lists.gnu.org/archive/html/bug-gnu-utils/2011-04/msg00021.html

Citation of Aharon Robbins (Wed, 27 Apr 2011 21:48:41 +0300):
> I do agree that the behavior is suprising, disconcerting, undesirable,
> and so on.  For this reason, the upcoming version of gawk translates
> ranges of the form [d-h] into '[defgh]' before compiling the regular
> expression.


Advice:
To get plain ASCII ordering, you can use either [[:upper:]] or [[:lower:]] for entire ranges, or you can use explicit ranges, such as [CDEFG], or finally you can use LC_ALL=C as a quick workaround.


Closing as NOTABUG.