1250830 – awk matches lowercase when searching for uppercase range

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1250830 - awk matches lowercase when searching for uppercase range

Summary: awk matches lowercase when searching for uppercase range

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	gawk
Sub Component:
Version:	6.6
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	David Kaspar // Dee'Kej
QA Contact:	Jan Kepler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1254457 1271443 1271701
TreeView+	depends on / blocked

Reported:	2015-08-06 06:18 UTC by Scott Rochford
Modified:	2019-08-15 05:03 UTC (History)
CC List:	11 users (show)
Fixed In Version:	gawk-3.1.7-11.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1271701 (view as bug list)
Environment:
Last Closed:	2016-06-17 13:31:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
proposed fix (backport) (11.67 KB, patch) 2015-09-24 12:18 UTC, David Kaspar // Dee'Kej	no flags	Details \| Diff
proposed fix (backport) v2 (16.81 KB, application/mbox) 2015-09-29 13:47 UTC, David Kaspar // Dee'Kej	kdudka: review+	Details
proposed fix (backport) v3 (16.98 KB, patch) 2015-09-30 13:39 UTC, David Kaspar // Dee'Kej	no flags	Details \| Diff
Show Obsolete (2) View All

Description Scott Rochford 2015-08-06 06:18:56 UTC

Description of problem:

When specifying an uppercase range '[A-Z]', lowercase characters are matched.  It appears to use a character sequence of aAbBcC...zZ, since it 'b' but does not match 'a'.

I am aware of the '[[:upper:]]' character class workaround, however this is a convention that has been used and will continue to be used for quite some time.

A similar problem was reported with 'grep': 

https://bugzilla.redhat.com/show_bug.cgi?id=1171806



Version-Release number of selected component (if applicable):

GNU Awk 3.1.7


How reproducible: 100%

Steps to Reproduce:
1. # export LANG=en_US.UTF-8
2. # echo b | awk '/[A-Z]/'


Actual results:

b


Expected results:

(nothing)

Additional info:

Comment 2 Scott Rochford 2015-08-06 06:36:23 UTC

Forgot release number: gawk-3.1.7-10.el6.x86_64

Comment 3 Zdenek Pytela 2015-09-21 06:46:39 UTC

Reproduced on:
RHEL 5: gawk-3.1.5-16.el5
RHEL 6: gawk-3.1.7-10.el6.x86_64

not RHEL 7 with gawk-4.0.2-4.el7.x86_64

Comment 4 David Kaspar // Dee'Kej 2015-09-21 10:08:58 UTC

I have also reproduced this behaviour on RHEL6. RHEL7 uses newer version, which undertook a major source code rebase, and it is already fixed there.

I have identified an upstream commits that fixes the issues. I'm currently working on extracting the necessary patch from it and porting it back.

Once I have it, I will post it here for review.

Comment 9 David Kaspar // Dee'Kej 2015-09-24 12:18:10 UTC

Created attachment 1076500 [details]
proposed fix (backport)

Comment 11 Kamil Dudka 2015-09-24 17:33:19 UTC

Comment on attachment 1076500 [details]
proposed fix (backport)

Why are the newly introduced buf/buf_len variables static?  It makes no sense without the following changes:

+   if (buf == NULL) {
+       emalloc(buf, char *, len + 2, "make_regexp");
+       buflen = len;
+   } else if (len > buflen) {
+       erealloc(buf, char *, len + 2, "make_regexp");
+       buflen = len;
+   }
+   dest = buf;

[...]

-
-   free(temp);

Please consider also backporting the related parts of the following upstream commits that fix bugs in the code introduced by this patch:

http://git.savannah.gnu.org/cgit/gawk.git/commit/?id=00d79709
http://git.savannah.gnu.org/cgit/gawk.git/commit/?id=2c126c49

Comment 12 David Kaspar // Dee'Kej 2015-09-29 13:47:07 UTC

Created attachment 1078348 [details]
proposed fix (backport) v2

Kamil, thank you for your review and valuable info about the additional fixes.

Here's is the (hopefully) final version for this backport. Could you please check the changes are sane?

Thank you.

Comment 15 Kamil Dudka 2015-09-29 20:27:17 UTC

Comment on attachment 1078348 [details]
proposed fix (backport) v2

Thanks for the update!  It looks better now.  But please revert the unnecessary re-indentation.  The previous patch used tabs for indentation (as the upstream code does).  The current version of the patch uses spaces for indentation where the upstream patches use tabs for indentation.  This makes the changes difficult to follow.

Comment 16 David Kaspar // Dee'Kej 2015-09-30 13:39:34 UTC

Created attachment 1078656 [details]
proposed fix (backport) v3

Ah, I didn't notice that. (Turned out I had some residual configuration in GIT, which was transforming tabs to hardtabs, and VIM, which was ignoring any indentation differences.) Thank you for your feedback.

Here's the same patch with just the indentation fixed to correspond more accurately with the upstream version.

Comment 17 Kamil Dudka 2015-10-01 08:36:20 UTC

Comment on attachment 1078656 [details]
proposed fix (backport) v3

No behavioral changes, keeping r+.  As a side note, this hunk is unnecessary:

> @@ -239,7 +266,7 @@ research(Regexp *rp, register char *str, int start,
>       if (rp->dfa && ! no_bol && ! need_start) {
>               char save;
>               int count = 0;
> -             /*
> +             /*
>                * dfa likes to stick a '\n' right after the matched
>                * text.  So we just save and restore the character.
>                */
> @@ -288,7 +315,7 @@ refree(Regexp *rp)
>               dfafree(&(rp->dfareg));
>       free(rp);
>  }
> - 
> +
>  /* dfaerror --- print an error message for the dfa routines */
>  
>  void

Comment 18 Kamil Dudka 2015-10-01 08:44:38 UTC

Note I was not able to actually set the review+ flag because of bug #1265524.

Comment 19 David Kaspar // Dee'Kej 2015-10-01 09:28:41 UTC

I'm setting this bug as a blocker for the RHEL-6.8, so it is evaluated by the Product Management.

Note You need to log in before you can comment on or make changes to this bug.