Bug 1250830 - awk matches lowercase when searching for uppercase range
awk matches lowercase when searching for uppercase range
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: gawk (Show other bugs)
All Linux
urgent Severity high
: rc
: ---
Assigned To: David Kaspar [Dee'Kej]
Jan Kepler
: Patch, Reproducer, ZStream
Depends On:
Blocks: 1254457 1271443 1271701
  Show dependency treegraph
Reported: 2015-08-06 02:18 EDT by Scott Rochford
Modified: 2016-06-17 09:31 EDT (History)
11 users (show)

See Also:
Fixed In Version: gawk-3.1.7-11.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1271701 (view as bug list)
Last Closed: 2016-06-17 09:31:59 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
proposed fix (backport) (11.67 KB, patch)
2015-09-24 08:18 EDT, David Kaspar [Dee'Kej]
no flags Details | Diff
proposed fix (backport) v2 (16.81 KB, application/mbox)
2015-09-29 09:47 EDT, David Kaspar [Dee'Kej]
kdudka: review+
proposed fix (backport) v3 (16.98 KB, patch)
2015-09-30 09:39 EDT, David Kaspar [Dee'Kej]
no flags Details | Diff

  None (edit)
Description Scott Rochford 2015-08-06 02:18:56 EDT
Description of problem:

When specifying an uppercase range '[A-Z]', lowercase characters are matched.  It appears to use a character sequence of aAbBcC...zZ, since it 'b' but does not match 'a'.

I am aware of the '[[:upper:]]' character class workaround, however this is a convention that has been used and will continue to be used for quite some time.

A similar problem was reported with 'grep': 


Version-Release number of selected component (if applicable):

GNU Awk 3.1.7

How reproducible: 100%

Steps to Reproduce:
1. # export LANG=en_US.UTF-8
2. # echo b | awk '/[A-Z]/'

Actual results:


Expected results:


Additional info:
Comment 2 Scott Rochford 2015-08-06 02:36:23 EDT
Forgot release number: gawk-3.1.7-10.el6.x86_64
Comment 3 Zdenek Pytela 2015-09-21 02:46:39 EDT
Reproduced on:
RHEL 5: gawk-3.1.5-16.el5
RHEL 6: gawk-3.1.7-10.el6.x86_64

not RHEL 7 with gawk-4.0.2-4.el7.x86_64
Comment 4 David Kaspar [Dee'Kej] 2015-09-21 06:08:58 EDT
I have also reproduced this behaviour on RHEL6. RHEL7 uses newer version, which undertook a major source code rebase, and it is already fixed there.

I have identified an upstream commits that fixes the issues. I'm currently working on extracting the necessary patch from it and porting it back.

Once I have it, I will post it here for review.
Comment 9 David Kaspar [Dee'Kej] 2015-09-24 08:18 EDT
Created attachment 1076500 [details]
proposed fix (backport)
Comment 11 Kamil Dudka 2015-09-24 13:33:19 EDT
Comment on attachment 1076500 [details]
proposed fix (backport)

Why are the newly introduced buf/buf_len variables static?  It makes no sense without the following changes:

+   if (buf == NULL) {
+       emalloc(buf, char *, len + 2, "make_regexp");
+       buflen = len;
+   } else if (len > buflen) {
+       erealloc(buf, char *, len + 2, "make_regexp");
+       buflen = len;
+   }
+   dest = buf;


-   free(temp);

Please consider also backporting the related parts of the following upstream commits that fix bugs in the code introduced by this patch:

Comment 12 David Kaspar [Dee'Kej] 2015-09-29 09:47 EDT
Created attachment 1078348 [details]
proposed fix (backport) v2

Kamil, thank you for your review and valuable info about the additional fixes.

Here's is the (hopefully) final version for this backport. Could you please check the changes are sane?

Thank you.
Comment 15 Kamil Dudka 2015-09-29 16:27:17 EDT
Comment on attachment 1078348 [details]
proposed fix (backport) v2

Thanks for the update!  It looks better now.  But please revert the unnecessary re-indentation.  The previous patch used tabs for indentation (as the upstream code does).  The current version of the patch uses spaces for indentation where the upstream patches use tabs for indentation.  This makes the changes difficult to follow.
Comment 16 David Kaspar [Dee'Kej] 2015-09-30 09:39 EDT
Created attachment 1078656 [details]
proposed fix (backport) v3

Ah, I didn't notice that. (Turned out I had some residual configuration in GIT, which was transforming tabs to hardtabs, and VIM, which was ignoring any indentation differences.) Thank you for your feedback.

Here's the same patch with just the indentation fixed to correspond more accurately with the upstream version.
Comment 17 Kamil Dudka 2015-10-01 04:36:20 EDT
Comment on attachment 1078656 [details]
proposed fix (backport) v3

No behavioral changes, keeping r+.  As a side note, this hunk is unnecessary:

> @@ -239,7 +266,7 @@ research(Regexp *rp, register char *str, int start,
>       if (rp->dfa && ! no_bol && ! need_start) {
>               char save;
>               int count = 0;
> -             /*
> +             /*
>                * dfa likes to stick a '\n' right after the matched
>                * text.  So we just save and restore the character.
>                */
> @@ -288,7 +315,7 @@ refree(Regexp *rp)
>               dfafree(&(rp->dfareg));
>       free(rp);
>  }
> - 
> +
>  /* dfaerror --- print an error message for the dfa routines */
>  void
Comment 18 Kamil Dudka 2015-10-01 04:44:38 EDT
Note I was not able to actually set the review+ flag because of bug #1265524.
Comment 19 David Kaspar [Dee'Kej] 2015-10-01 05:28:41 EDT
I'm setting this bug as a blocker for the RHEL-6.8, so it is evaluated by the Product Management.

Note You need to log in before you can comment on or make changes to this bug.