Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1653745

Summary:	glibc: Czech digraph 'ch' no longer in range [a-z]
Product:	Red Hat Enterprise Linux 8	Reporter:	Sergey Kolosov <skolosov>
Component:	glibc	Assignee:	Carlos O'Donell <codonell>
Status:	CLOSED NOTABUG	QA Contact:	qe-baseos-tools-bugs
Severity:	medium	Docs Contact:
Priority:	medium
Version:	8.0	CC:	ashankar, codonell, dj, fweimer, mnewsome, ohudlick, pbonzini, pfrankli, schwab
Target Milestone:	rc
Target Release:	8.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	587360	Environment:
Last Closed:	2018-12-11 16:35:20 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	587360
Bug Blocks:

Comment 1 Florian Weimer 2018-11-27 14:39:49 UTC

echo abcdefghchijklmnopqrstuvwxyz | LC_ALL=cs_CZ.UTF-8 sed 's/[^a-z]/-/g'

should print:

abcdefghchijklmnopqrstuvwxyz

and not:

abcdefgh-ijklmnopqrstuvwxyz

The latter is the result on glibc-2.28-35.el8.

This is not a regression; the regression test did not run in the correct environment before.

Comment 2 Carlos O'Donell 2018-12-01 03:00:51 UTC

The test bz587360-digraph-matching-differs-across-archs must be fixed to look for the same result across all arches *not* a particular result across all arches.

Ranges are only defined for the C/POSIX locales, and are unspecified for any other locales including cs_CZ.UTF-8.

The behaviour we see here is a result of upstream commit 159738548130d5ac4fe6178977e940ed5f8cfdc4 to support language collation across all languages after the ISO 14651 update (Unicode 9).

For example:
echo 'abcčČdefghchcHChCHijklmnopqrřŘsšŠtuvwxyzžŽ' | LC_ALL=cs_CZ.UTF-8 sed 's/[^a-z]/-/g'
abc--defgh----ijklmnopqr--s--tuvwxyz--

This is the expected behaviour in RHEL 8. Note that all 4 digraphs are counted correctly and only 4 dashes are present.

glibc changes the collation element ordering (CEO) for <c-caron>, <ch-digraph>, <r-caron>, <s-caron>, and <z-caron>, and so these 5 symbols will no longer sort within [a-z]. To make them sort within [a-z] requires rewriting the cs_CZ locale to ensure CEO includes the above elements symbols with all the other exepected symbols in the range. This expectation would duplicate all symbols into the cs_CZ locale, making it significantly bigger and taking up more memory. This is not the direction we want to go.

In Bug 1601681 we fixed the most serious CEO issues by fixing the collation element ordering for the latin symbol set, but this does not include digraphs. Again, we fixed latin symbols because there was an expectation that [a-z] did not include any of A-Z, like it *might* in some collations.

Given that we are moving to rational ranges for *all* languages, I don't see that we want to fix this and instead we want to leave it as-is, because this is the behaviour you will get with rational ranges e.g. [a-z], [A-Z], [0-9] will only include the symbols 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' and '0123456789'.

In summary
- Range expressions like [a-z] are unspecified for any locale but POSIX.
- We want to move towards rational ranges and 'ch' will not match [a-z] either for a rational range.
- The test bz587360-digraph-matching-differs-across-archs tests for undefined behaviour, it should be adjusted to look for the same result across all arches, and not a particular result.

Comment 3 Carlos O'Donell 2018-12-01 03:51:30 UTC

Notes:

If you need a-z *and* ch, then you need to use individual collation elements:

echo 'aAᴂbBcCčČdDeEfFgGhHchcHChCHiIjJkKlLmMnNoOpPqQrRřŘsSšŠtTuUvVwWxXyYzZžŽ' | LC_ALL=cs_CZ.UTF-8 sed 's/[^a-z[.ch.]A-Z]/-/g'
aAᴂbBcC--dDeEfFgGhHch---iIjJkKlLmMnNoOpPqQrR--sS--tTuUvVwWxXyYzZ--

I included <U1D02> to show that *other* special characters are still correctly sorted within a-z, it's just that the 5 symbols listed in comment #2 are no longer within that set. Again, I don't plan to fix this for RHEL 8.0 because we want glibc to support rational ranges, most likely in the form of code point ranges.

One consideration is that br_FR is the only other locale to use a ch digraph, and so we might promote the ch digraphs to iso14651_t1_common, and fix this issue for cs_CZ (leaving it broken for br_FR).

echo 'aAᴂbBcCčČdDeEfFgGhHchcHChCHiIjJkKlLmMnNoOpPqQrRřŘsSšŠtTuUvVwWxXyYzZžŽ' | LC_ALL=br_FR.UTF-8 sed 's/[^a-zA-Z]/-/g'
aAᴂbBcCčČdDeEfFgGhH----iIjJkKlLmMnNoOpPqQrRřŘsSšŠtTuUvVwWxXyYzZ--

Again this is more work and we want to go in the other direction, towards rational ranges.