Bug 1250830

Summary: awk matches lowercase when searching for uppercase range
Product: Red Hat Enterprise Linux 6 Reporter: Scott Rochford <scott.rochford>
Component: gawkAssignee: David Kaspar // Dee'Kej <deekej>
Status: CLOSED CURRENTRELEASE QA Contact: Jan Kepler <jkejda>
Severity: high Docs Contact:
Priority: urgent    
Version: 6.6CC: dkutalek, fkrska, jkejda, jkurik, kdudka, mkolaja, msaxena, ohudlick, praiskup, thozza, zpytela
Target Milestone: rcKeywords: Patch, Reproducer, ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: gawk-3.1.7-11.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1271701 (view as bug list) Environment:
Last Closed: 2016-06-17 13:31:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1254457, 1271443, 1271701    
Attachments:
Description Flags
proposed fix (backport)
none
proposed fix (backport) v2
kdudka: review+
proposed fix (backport) v3 none

Description Scott Rochford 2015-08-06 06:18:56 UTC
Description of problem:

When specifying an uppercase range '[A-Z]', lowercase characters are matched.  It appears to use a character sequence of aAbBcC...zZ, since it 'b' but does not match 'a'.

I am aware of the '[[:upper:]]' character class workaround, however this is a convention that has been used and will continue to be used for quite some time.

A similar problem was reported with 'grep': 

https://bugzilla.redhat.com/show_bug.cgi?id=1171806



Version-Release number of selected component (if applicable):

GNU Awk 3.1.7


How reproducible: 100%

Steps to Reproduce:
1. # export LANG=en_US.UTF-8
2. # echo b | awk '/[A-Z]/'


Actual results:

b


Expected results:

(nothing)

Additional info:

Comment 2 Scott Rochford 2015-08-06 06:36:23 UTC
Forgot release number: gawk-3.1.7-10.el6.x86_64

Comment 3 Zdenek Pytela 2015-09-21 06:46:39 UTC
Reproduced on:
RHEL 5: gawk-3.1.5-16.el5
RHEL 6: gawk-3.1.7-10.el6.x86_64

not RHEL 7 with gawk-4.0.2-4.el7.x86_64

Comment 4 David Kaspar // Dee'Kej 2015-09-21 10:08:58 UTC
I have also reproduced this behaviour on RHEL6. RHEL7 uses newer version, which undertook a major source code rebase, and it is already fixed there.

I have identified an upstream commits that fixes the issues. I'm currently working on extracting the necessary patch from it and porting it back.

Once I have it, I will post it here for review.

Comment 9 David Kaspar // Dee'Kej 2015-09-24 12:18:10 UTC
Created attachment 1076500 [details]
proposed fix (backport)

Comment 11 Kamil Dudka 2015-09-24 17:33:19 UTC
Comment on attachment 1076500 [details]
proposed fix (backport)

Why are the newly introduced buf/buf_len variables static?  It makes no sense without the following changes:

+   if (buf == NULL) {
+       emalloc(buf, char *, len + 2, "make_regexp");
+       buflen = len;
+   } else if (len > buflen) {
+       erealloc(buf, char *, len + 2, "make_regexp");
+       buflen = len;
+   }
+   dest = buf;

[...]

-
-   free(temp);

Please consider also backporting the related parts of the following upstream commits that fix bugs in the code introduced by this patch:

http://git.savannah.gnu.org/cgit/gawk.git/commit/?id=00d79709
http://git.savannah.gnu.org/cgit/gawk.git/commit/?id=2c126c49

Comment 12 David Kaspar // Dee'Kej 2015-09-29 13:47:07 UTC
Created attachment 1078348 [details]
proposed fix (backport) v2

Kamil, thank you for your review and valuable info about the additional fixes.

Here's is the (hopefully) final version for this backport. Could you please check the changes are sane?

Thank you.

Comment 15 Kamil Dudka 2015-09-29 20:27:17 UTC
Comment on attachment 1078348 [details]
proposed fix (backport) v2

Thanks for the update!  It looks better now.  But please revert the unnecessary re-indentation.  The previous patch used tabs for indentation (as the upstream code does).  The current version of the patch uses spaces for indentation where the upstream patches use tabs for indentation.  This makes the changes difficult to follow.

Comment 16 David Kaspar // Dee'Kej 2015-09-30 13:39:34 UTC
Created attachment 1078656 [details]
proposed fix (backport) v3

Ah, I didn't notice that. (Turned out I had some residual configuration in GIT, which was transforming tabs to hardtabs, and VIM, which was ignoring any indentation differences.) Thank you for your feedback.

Here's the same patch with just the indentation fixed to correspond more accurately with the upstream version.

Comment 17 Kamil Dudka 2015-10-01 08:36:20 UTC
Comment on attachment 1078656 [details]
proposed fix (backport) v3

No behavioral changes, keeping r+.  As a side note, this hunk is unnecessary:

> @@ -239,7 +266,7 @@ research(Regexp *rp, register char *str, int start,
>       if (rp->dfa && ! no_bol && ! need_start) {
>               char save;
>               int count = 0;
> -             /*
> +             /*
>                * dfa likes to stick a '\n' right after the matched
>                * text.  So we just save and restore the character.
>                */
> @@ -288,7 +315,7 @@ refree(Regexp *rp)
>               dfafree(&(rp->dfareg));
>       free(rp);
>  }
> - 
> +
>  /* dfaerror --- print an error message for the dfa routines */
>  
>  void

Comment 18 Kamil Dudka 2015-10-01 08:44:38 UTC
Note I was not able to actually set the review+ flag because of bug #1265524.

Comment 19 David Kaspar // Dee'Kej 2015-10-01 09:28:41 UTC
I'm setting this bug as a blocker for the RHEL-6.8, so it is evaluated by the Product Management.