Description of problem:
There is a change in Evaluation of regular expression since Fedora 28!
It affects applications which are using functions from regex.h.
I detected problem with character 'w' in Swedish.
Version-Release number of selected component (if applicable):
glibc in Fedora 28 and newer.
A range expression '[a-z]' matches character 'w' in LANG=C.
export LANG=C; echo 'w' | grep '[a-z]'
But in LANG=sv_SE.UTF8 it matches only until Fedora 27. Since Fedora 28
(newer glibc?) does not!
export LANG=sv_SE.UTF8; echo 'w' | grep '[a-z]
A range expression '[a-z]' does not match character 'w' in LANG=sv_SE.UTF8.
A range expression '[a-z]' matches character 'w' in LANG=sv_SE.UTF8.
'w' character is basic character in Swedish alphabet since 2006. More info in https://bugzilla.redhat.com/show_bug.cgi?id=1598336
This is rather puzzling. I can reproduce it even with glibc-2.28-6.fc29.x86_64 and grep-3.1-8.fc29.x86_64 on Fedora 29, which should have the related bug 1607286 fixed.
Carlos is this supposed be fixed at all?
(In reply to Florian Weimer from comment #2)
> Carlos is this supposed be fixed at all?
No, this is not supposed to be fixed in sv_SE, and will not be fixed until we implement rational ranges.
The reason being that 'w' changed collation order, which was fixed in glibc 2.27 (commit 15973854813), which harmonized our collation with CLDR. Since then we correctly place 'w' in the equivalence class of 'v' and sort with the normal sorting rules.
~~~ localedata/locales/sv_SE ~~~
% The letter w is normally not present in the Swedish alphabet. It
% exists in some names in Swedish and foreign words, but is accounted
% for as a variant of 'v'. Words and names with 'w' are in Swedish
% ordered alphabetically among the words and names with 'v'. If two
% words or names are only to be distinguished by 'v' or % 'w', 'v' is
% placed before 'w'.
<U0057> <S0076>;"<BASE><VRNT1>";"<CAP><MIN>";IGNORE % W
<U0077> <S0076>;"<BASE><VRNT1>";"<MIN><MIN>";IGNORE % w
so uUvVwW (today), instead of uUvwVW (previously).
However, since that point we no longer have the CEO required to support [a-z] range matching in Swedish. We are not required to do so because POSIX says any locale but C has unspecfied behaviour for the range matching.
This bug is really a duplicate request for rational range support. Once we have rational range support this will work as expected in sv_SE locale.
We could fix it today be doing the required surgery to sv_SE, but we inherit this from upstream.
I'm going to mark this as CLOSED/WONTFIX since this issue has to get solved upstream first before we backport any solution. An upstream solution would land in Fedora at a maximum of 6 months later when a new Fedora is released (immediately in the case of Rawhide).
The upstream issue is this:
I don't see much value in tracking it in Fedora, unless we want to ensure that there is continued visibility and pressure to ensure a fix goes upstream.
The current solution is that you must use the C/POSIX locale to get range matching as required by the POSIX standard.