Bug 1631472 - Locale support in regular expression and range expression
Summary: Locale support in regular expression and range expression
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: 29
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Carlos O'Donell
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-20 16:14 UTC by Jaroslav Rohel
Modified: 2018-10-01 16:28 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-01 16:28:56 UTC
Type: Bug


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1607286 0 high CLOSED glibc regex [a-z] and [A-Z] results changed for English locales after harmonization with Unicode/ISO 14651. 2021-02-22 00:41:40 UTC
Sourceware 23393 0 None None None 2018-10-01 16:28:55 UTC

Internal Links: 1607286

Description Jaroslav Rohel 2018-09-20 16:14:27 UTC
Description of problem:
There is a change in Evaluation of regular expression since Fedora 28!
It affects applications which are using functions from regex.h.

I detected problem with character 'w' in Swedish.

Version-Release number of selected component (if applicable):
glibc in Fedora 28 and newer.

How reproducible:
A range expression '[a-z]' matches character 'w' in LANG=C.
export LANG=C; echo 'w' | grep '[a-z]'
 
But in LANG=sv_SE.UTF8 it matches only until Fedora 27. Since Fedora 28
(newer glibc?) does not!
export LANG=sv_SE.UTF8; echo 'w' | grep '[a-z]

Actual results:
A range expression '[a-z]' does not match character 'w' in LANG=sv_SE.UTF8.

Expected results:
A range expression '[a-z]' matches character 'w' in LANG=sv_SE.UTF8.

'w' character is basic character in Swedish alphabet since 2006. More info in https://bugzilla.redhat.com/show_bug.cgi?id=1598336

Comment 1 Florian Weimer 2018-09-20 17:33:17 UTC
This is rather puzzling.  I can reproduce it even with glibc-2.28-6.fc29.x86_64 and grep-3.1-8.fc29.x86_64 on Fedora 29, which should have the related bug 1607286 fixed.

Comment 2 Florian Weimer 2018-09-20 17:33:44 UTC
Carlos is this supposed be fixed at all?

Comment 3 Carlos O'Donell 2018-10-01 15:49:42 UTC
(In reply to Florian Weimer from comment #2)
> Carlos is this supposed be fixed at all?

No, this is not supposed to be fixed in sv_SE, and will not be fixed until we implement rational ranges.

The reason being that 'w' changed collation order, which was fixed in glibc 2.27 (commit 15973854813), which harmonized our collation with CLDR. Since then we correctly place 'w' in the equivalence class of 'v' and sort with the normal sorting rules.

i.e.
~~~ localedata/locales/sv_SE ~~~
% The letter w is normally not present in the Swedish alphabet. It
% exists in some names in Swedish and foreign words, but is accounted
% for as a variant of 'v'.  Words and names with 'w' are in Swedish
% ordered alphabetically among the words and names with 'v'. If two
% words or names are only to be distinguished by 'v' or % 'w', 'v' is
% placed before 'w'.

% &v<<<V<<w<<<W
<U0057> <S0076>;"<BASE><VRNT1>";"<CAP><MIN>";IGNORE % W
<U0077> <S0076>;"<BASE><VRNT1>";"<MIN><MIN>";IGNORE % w
~~~
so uUvVwW (today), instead of uUvwVW (previously).

However, since that point we no longer have the CEO required to support [a-z] range matching in Swedish. We are not required to do so because POSIX says any locale but C has unspecfied behaviour for the range matching.

This bug is really a duplicate request for rational range support. Once we have rational range support this will work as expected in sv_SE locale.

We could fix it today be doing the required surgery to sv_SE, but we inherit this from upstream.

Comment 4 Carlos O'Donell 2018-10-01 16:28:56 UTC
I'm going to mark this as CLOSED/WONTFIX since this issue has to get solved upstream first before we backport any solution. An upstream solution would land in Fedora at a maximum of 6 months later when a new Fedora is released (immediately in the case of Rawhide).

The upstream issue is this:
https://sourceware.org/bugzilla/show_bug.cgi?id=23393

I don't see much value in tracking it in Fedora, unless we want to ensure that there is continued visibility and pressure to ensure a fix goes upstream.

The current solution is that you must use the C/POSIX locale to get range matching as required by the POSIX standard.


Note You need to log in before you can comment on or make changes to this bug.