1631472 – Locale support in regular expression and range expression

Bug 1631472 - Locale support in regular expression and range expression

Summary: Locale support in regular expression and range expression

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	29
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Carlos O'Donell
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-20 16:14 UTC by Jaroslav Rohel
Modified:	2023-12-19 12:43 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-10-01 16:28:56 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1607286	0	high	CLOSED	glibc regex [a-z] and [A-Z] results changed for English locales after harmonization with Unicode/ISO 14651.	2021-02-22 00:41:40 UTC
Sourceware	23393	0	None	None	None	2018-10-01 16:28:55 UTC

Internal Links: 1607286

Description Jaroslav Rohel 2018-09-20 16:14:27 UTC

Description of problem:
There is a change in Evaluation of regular expression since Fedora 28!
It affects applications which are using functions from regex.h.

I detected problem with character 'w' in Swedish.

Version-Release number of selected component (if applicable):
glibc in Fedora 28 and newer.

How reproducible:
A range expression '[a-z]' matches character 'w' in LANG=C.
export LANG=C; echo 'w' | grep '[a-z]'
 
But in LANG=sv_SE.UTF8 it matches only until Fedora 27. Since Fedora 28
(newer glibc?) does not!
export LANG=sv_SE.UTF8; echo 'w' | grep '[a-z]

Actual results:
A range expression '[a-z]' does not match character 'w' in LANG=sv_SE.UTF8.

Expected results:
A range expression '[a-z]' matches character 'w' in LANG=sv_SE.UTF8.

'w' character is basic character in Swedish alphabet since 2006. More info in https://bugzilla.redhat.com/show_bug.cgi?id=1598336

Comment 1 Florian Weimer 2018-09-20 17:33:17 UTC

This is rather puzzling.  I can reproduce it even with glibc-2.28-6.fc29.x86_64 and grep-3.1-8.fc29.x86_64 on Fedora 29, which should have the related bug 1607286 fixed.

Comment 2 Florian Weimer 2018-09-20 17:33:44 UTC

Carlos is this supposed be fixed at all?

Comment 3 Carlos O'Donell 2018-10-01 15:49:42 UTC

(In reply to Florian Weimer from comment #2)
> Carlos is this supposed be fixed at all?

No, this is not supposed to be fixed in sv_SE, and will not be fixed until we implement rational ranges.

The reason being that 'w' changed collation order, which was fixed in glibc 2.27 (commit 15973854813), which harmonized our collation with CLDR. Since then we correctly place 'w' in the equivalence class of 'v' and sort with the normal sorting rules.

i.e.
~~~ localedata/locales/sv_SE ~~~
% The letter w is normally not present in the Swedish alphabet. It
% exists in some names in Swedish and foreign words, but is accounted
% for as a variant of 'v'.  Words and names with 'w' are in Swedish
% ordered alphabetically among the words and names with 'v'. If two
% words or names are only to be distinguished by 'v' or % 'w', 'v' is
% placed before 'w'.

% &v<<<V<<w<<<W
<U0057> <S0076>;"<BASE><VRNT1>";"<CAP><MIN>";IGNORE % W
<U0077> <S0076>;"<BASE><VRNT1>";"<MIN><MIN>";IGNORE % w
~~~
so uUvVwW (today), instead of uUvwVW (previously).

However, since that point we no longer have the CEO required to support [a-z] range matching in Swedish. We are not required to do so because POSIX says any locale but C has unspecfied behaviour for the range matching.

This bug is really a duplicate request for rational range support. Once we have rational range support this will work as expected in sv_SE locale.

We could fix it today be doing the required surgery to sv_SE, but we inherit this from upstream.

Comment 4 Carlos O'Donell 2018-10-01 16:28:56 UTC

I'm going to mark this as CLOSED/WONTFIX since this issue has to get solved upstream first before we backport any solution. An upstream solution would land in Fedora at a maximum of 6 months later when a new Fedora is released (immediately in the case of Rawhide).

The upstream issue is this:
https://sourceware.org/bugzilla/show_bug.cgi?id=23393

I don't see much value in tracking it in Fedora, unless we want to ensure that there is continued visibility and pressure to ensure a fix goes upstream.

The current solution is that you must use the C/POSIX locale to get range matching as required by the POSIX standard.

Comment 5 Per Lundberg 2023-12-19 12:43:08 UTC

For reference, in case this helps anyone: the original problem with w not being included in [a-z] ranges if sv_SE or sv_FI locales are being used *seems* to have been fixed in some recent (glibc?) update. I haven't tested this on any Red Hat or Fedora-related systems, but sharing my findings here regardless.

With Ubuntu 20.04, this is easily reproducible ("Invalid range end" below):

❯ docker run -it --rm ubuntu:20.04
root@02ef59a8f0e0:/# apt-get update && apt-get install locales && echo sv_SE.UTF-8 UTF-8 >> /etc/locale.gen && locale-gen
Get:1 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]
Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Get:5 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [3130 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal/restricted amd64 Packages [33.4 kB]
Get:7 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages [1275 kB]    
Get:8 http://archive.ubuntu.com/ubuntu focal/universe amd64 Packages [11.3 MB]
Get:9 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [29.3 kB] 
Get:10 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [3283 kB]        
Get:11 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [1149 kB]   
Get:12 http://archive.ubuntu.com/ubuntu focal/multiverse amd64 Packages [177 kB]             
Get:13 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1444 kB]
Get:14 http://archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages [32.0 kB]
Get:15 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [3279 kB]
Get:16 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [3761 kB]
Get:17 http://archive.ubuntu.com/ubuntu focal-backports/universe amd64 Packages [28.6 kB]
Get:18 http://archive.ubuntu.com/ubuntu focal-backports/main amd64 Packages [55.2 kB]
Fetched 29.6 MB in 2s (15.3 MB/s)                          
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  locales
0 upgraded, 1 newly installed, 0 to remove and 9 not upgraded.
Need to get 3871 kB of archives.
After this operation, 17.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 locales all 2.31-0ubuntu9.14 [3871 kB]
Fetched 3871 kB in 1s (3415 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package locales.
(Reading database ... 4126 files and directories currently installed.)
Preparing to unpack .../locales_2.31-0ubuntu9.14_all.deb ...
Unpacking locales (2.31-0ubuntu9.14) ...
Setting up locales (2.31-0ubuntu9.14) ...
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install the Term::ReadLine module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.30.0 /usr/local/share/perl/5.30.0 /usr/lib/x86_64-linux-gnu/perl5/5.30 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.30 /usr/share/perl/5.30 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)
debconf: falling back to frontend: Teletype
Generating locales (this might take a while)...
Generation complete.
Generating locales (this might take a while)...
  sv_SE.UTF-8... done
Generation complete.
root@02ef59a8f0e0:/# export LC_ALL=sv_SE.UTF-8
root@02ef59a8f0e0:/# echo z | grep [a-w] ; echo $?
grep: Invalid range end
2

Trying with Ubuntu 22.04, this no longer seems to produce any error:

❯ docker run -it --rm ubuntu:22.04
root@e4241604a208:/# apt-get update && apt-get install locales && echo sv_SE.UTF-8 UTF-8 >> /etc/locale.gen && locale-gen
Get:1 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/restricted amd64 Packages [164 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB]
Get:6 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB]
Get:7 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [1602 kB]
Get:9 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]      
Get:10 http://archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [49.8 kB]
Get:11 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1305 kB]      
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [1599 kB]          
Get:13 http://security.ubuntu.com/ubuntu jammy-security/multiverse amd64 Packages [44.0 kB]         
Get:14 http://archive.ubuntu.com/ubuntu jammy-backports/main amd64 Packages [50.4 kB]
Get:15 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1326 kB]
Get:16 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 Packages [28.1 kB]
Get:17 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [1572 kB]
Get:18 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1046 kB]
Fetched 28.9 MB in 3s (11.5 MB/s)                           
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  locales
0 upgraded, 1 newly installed, 0 to remove and 16 not upgraded.
Need to get 4245 kB of archives.
After this operation, 17.5 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 locales all 2.35-0ubuntu3.5 [4245 kB]
Fetched 4245 kB in 2s (1699 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package locales.
(Reading database ... 4395 files and directories currently installed.)
Preparing to unpack .../locales_2.35-0ubuntu3.5_all.deb ...
Unpacking locales (2.35-0ubuntu3.5) ...
Setting up locales (2.35-0ubuntu3.5) ...
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install the Term::ReadLine module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.34.0 /usr/local/share/perl/5.34.0 /usr/lib/x86_64-linux-gnu/perl5/5.34 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl-base /usr/lib/x86_64-linux-gnu/perl/5.34 /usr/share/perl/5.34 /usr/local/lib/site_perl) at /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)
debconf: falling back to frontend: Teletype
Generating locales (this might take a while)...
Generation complete.
Generating locales (this might take a while)...
  sv_SE.UTF-8... done
Generation complete.
root@e4241604a208:/# export LC_ALL=sv_SE.UTF-8   
root@e4241604a208:/# echo z | grep [a-w] ; echo $?
1

We also tested with "grep"ing for strings containing w on these versions: Ubuntu 20.04 doesn't match the w character when using [a-z] (as expected because of this bug), but 22.04 *does* match the w character in [a-z] even with sv_SE and sv_FI locales.

Note You need to log in before you can comment on or make changes to this bug.