Bug 1631472

Summary: Locale support in regular expression and range expression
Product: [Fedora] Fedora Reporter: Jaroslav Rohel <jrohel>
Component: glibcAssignee: Carlos O'Donell <codonell>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 29CC: aoliva, arjun.is, codonell, dj, fweimer, law, mfabian, nige, pfrankli, rth, siddhesh
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-01 16:28:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jaroslav Rohel 2018-09-20 16:14:27 UTC
Description of problem:
There is a change in Evaluation of regular expression since Fedora 28!
It affects applications which are using functions from regex.h.

I detected problem with character 'w' in Swedish.

Version-Release number of selected component (if applicable):
glibc in Fedora 28 and newer.

How reproducible:
A range expression '[a-z]' matches character 'w' in LANG=C.
export LANG=C; echo 'w' | grep '[a-z]'
But in LANG=sv_SE.UTF8 it matches only until Fedora 27. Since Fedora 28
(newer glibc?) does not!
export LANG=sv_SE.UTF8; echo 'w' | grep '[a-z]

Actual results:
A range expression '[a-z]' does not match character 'w' in LANG=sv_SE.UTF8.

Expected results:
A range expression '[a-z]' matches character 'w' in LANG=sv_SE.UTF8.

'w' character is basic character in Swedish alphabet since 2006. More info in https://bugzilla.redhat.com/show_bug.cgi?id=1598336

Comment 1 Florian Weimer 2018-09-20 17:33:17 UTC
This is rather puzzling.  I can reproduce it even with glibc-2.28-6.fc29.x86_64 and grep-3.1-8.fc29.x86_64 on Fedora 29, which should have the related bug 1607286 fixed.

Comment 2 Florian Weimer 2018-09-20 17:33:44 UTC
Carlos is this supposed be fixed at all?

Comment 3 Carlos O'Donell 2018-10-01 15:49:42 UTC
(In reply to Florian Weimer from comment #2)
> Carlos is this supposed be fixed at all?

No, this is not supposed to be fixed in sv_SE, and will not be fixed until we implement rational ranges.

The reason being that 'w' changed collation order, which was fixed in glibc 2.27 (commit 15973854813), which harmonized our collation with CLDR. Since then we correctly place 'w' in the equivalence class of 'v' and sort with the normal sorting rules.

~~~ localedata/locales/sv_SE ~~~
% The letter w is normally not present in the Swedish alphabet. It
% exists in some names in Swedish and foreign words, but is accounted
% for as a variant of 'v'.  Words and names with 'w' are in Swedish
% ordered alphabetically among the words and names with 'v'. If two
% words or names are only to be distinguished by 'v' or % 'w', 'v' is
% placed before 'w'.

% &v<<<V<<w<<<W
<U0057> <S0076>;"<BASE><VRNT1>";"<CAP><MIN>";IGNORE % W
<U0077> <S0076>;"<BASE><VRNT1>";"<MIN><MIN>";IGNORE % w
so uUvVwW (today), instead of uUvwVW (previously).

However, since that point we no longer have the CEO required to support [a-z] range matching in Swedish. We are not required to do so because POSIX says any locale but C has unspecfied behaviour for the range matching.

This bug is really a duplicate request for rational range support. Once we have rational range support this will work as expected in sv_SE locale.

We could fix it today be doing the required surgery to sv_SE, but we inherit this from upstream.

Comment 4 Carlos O'Donell 2018-10-01 16:28:56 UTC
I'm going to mark this as CLOSED/WONTFIX since this issue has to get solved upstream first before we backport any solution. An upstream solution would land in Fedora at a maximum of 6 months later when a new Fedora is released (immediately in the case of Rawhide).

The upstream issue is this:

I don't see much value in tracking it in Fedora, unless we want to ensure that there is continued visibility and pressure to ensure a fix goes upstream.

The current solution is that you must use the C/POSIX locale to get range matching as required by the POSIX standard.