Hide Forgot
Description of problem: gawk regular expressions when using '-' in alpha range match out of the given range, for example [A-Z] will match [A-Z] and [b-z] Version-Release number of selected component (if applicable): gawk-3.1.5-14 bug also exists in gawk-3.1.1-9 How reproducible: echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'| awk '{gsub(/[A-Z]/,".");print $0}' echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'| awk '{gsub(/[a-z]/,".");print $0}' cat << MEL >foo ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz MEL awk '/[A-Z]/{print "UPPER: " $0}' foo awk '/[a-z]/{print "lower: " $0}' foo Steps to Reproduce: 1. use a regular expression range using '-' in gawk to match just upper or lower case (ie: '[A-Z]' or '[a-z]' or '[b-d]', etc) 2. 3. Actual results: echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'| awk '{gsub(/[A-Z]/,".");print $0}' ..........................a......................... echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'| awk '{gsub(/[a-z]/,".");print $0}' .........................Z.......................... Expected results: echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'| awk '{gsub(/[A-Z]/,".");print $0}' ..........................abcdefghijklmnopqrstuvwxyz echo 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'| awk '{gsub(/[a-z]/,".");print $0}' ABCDEFGHIJKLMNOPQRSTUVWXYZ.......................... Additional info: '[A-Z]' acts as if it is '[A-Zb-z]' '[a-z]' acts as if it is '[A-Ya-z]' '[BC]' works, but '[B-C]' acts as if it is '[B-Cc]' This has been broken for a while. I tested gawk-3.1.1-9, and it has the same problem.
Fixed with export LANG=C
This is not a bug, but expected behaviour of non-C locales, especially their different collation order. (In reply to comment #0) > Additional info: > '[A-Z]' acts as if it is '[A-Zb-z]' > '[a-z]' acts as if it is '[A-Ya-z]' > '[BC]' works, but '[B-C]' acts as if it is '[B-Cc]' Note, that locale collation order can be something like "AaBbCcDd..", so [A-Z] range can expand to [AaBbCcDd..] and [B-C] can expand to [BbCc]. This behaviour was coincidently discussed upstream few days ago, see: http://lists.gnu.org/archive/html/bug-gnu-utils/2011-04/msg00021.html Citation of Aharon Robbins (Wed, 27 Apr 2011 21:48:41 +0300): > I do agree that the behavior is suprising, disconcerting, undesirable, > and so on. For this reason, the upcoming version of gawk translates > ranges of the form [d-h] into '[defgh]' before compiling the regular > expression. Advice: To get plain ASCII ordering, you can use either [[:upper:]] or [[:lower:]] for entire ranges, or you can use explicit ranges, such as [CDEFG], or finally you can use LC_ALL=C as a quick workaround. Closing as NOTABUG.