Bug 1653745

Summary: glibc: Czech digraph 'ch' no longer in range [a-z]
Product: Red Hat Enterprise Linux 8 Reporter: Sergey Kolosov <skolosov>
Component: glibcAssignee: Carlos O'Donell <codonell>
Status: CLOSED NOTABUG QA Contact: qe-baseos-tools-bugs
Severity: medium Docs Contact:
Priority: medium    
Version: 8.0CC: ashankar, codonell, dj, fweimer, mnewsome, ohudlick, pbonzini, pfrankli, schwab
Target Milestone: rc   
Target Release: 8.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 587360 Environment:
Last Closed: 2018-12-11 16:35:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 587360    
Bug Blocks:    

Comment 1 Florian Weimer 2018-11-27 14:39:49 UTC
echo abcdefghchijklmnopqrstuvwxyz | LC_ALL=cs_CZ.UTF-8 sed 's/[^a-z]/-/g'

should print:

abcdefghchijklmnopqrstuvwxyz

and not:

abcdefgh-ijklmnopqrstuvwxyz

The latter is the result on glibc-2.28-35.el8.

This is not a regression; the regression test did not run in the correct environment before.

Comment 2 Carlos O'Donell 2018-12-01 03:00:51 UTC
The test bz587360-digraph-matching-differs-across-archs must be fixed to look for the same result across all arches *not* a particular result across all arches.

Ranges are only defined for the C/POSIX locales, and are unspecified for any other locales including cs_CZ.UTF-8.

The behaviour we see here is a result of upstream commit 159738548130d5ac4fe6178977e940ed5f8cfdc4 to support language collation across all languages after the ISO 14651 update (Unicode 9).

For example:
echo 'abcčČdefghchcHChCHijklmnopqrřŘsšŠtuvwxyzžŽ' | LC_ALL=cs_CZ.UTF-8 sed 's/[^a-z]/-/g'
abc--defgh----ijklmnopqr--s--tuvwxyz--

This is the expected behaviour in RHEL 8. Note that all 4 digraphs are counted correctly and only 4 dashes are present.

glibc changes the collation element ordering (CEO) for <c-caron>, <ch-digraph>, <r-caron>, <s-caron>, and <z-caron>, and so these 5 symbols will no longer sort within [a-z]. To make them sort within [a-z] requires rewriting the cs_CZ locale to ensure CEO includes the above elements symbols with all the other exepected symbols in the range. This expectation would duplicate all symbols into the cs_CZ locale, making it significantly bigger and taking up more memory. This is not the direction we want to go.

In Bug 1601681 we fixed the most serious CEO issues by fixing the collation element ordering for the latin symbol set, but this does not include digraphs. Again, we fixed latin symbols because there was an expectation that [a-z] did not include any of A-Z, like it *might* in some collations.

Given that we are moving to rational ranges for *all* languages, I don't see that we want to fix this and instead we want to leave it as-is, because this is the behaviour you will get with rational ranges e.g. [a-z], [A-Z], [0-9] will only include the symbols 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' and '0123456789'.

In summary
- Range expressions like [a-z] are unspecified for any locale but POSIX.
- We want to move towards rational ranges and 'ch' will not match [a-z] either for a rational range.
- The test bz587360-digraph-matching-differs-across-archs tests for undefined behaviour, it should be adjusted to look for the same result across all arches, and not a particular result.

Comment 3 Carlos O'Donell 2018-12-01 03:51:30 UTC
Notes:

If you need a-z *and* ch, then you need to use individual collation elements:

echo 'aAᴂbBcCčČdDeEfFgGhHchcHChCHiIjJkKlLmMnNoOpPqQrRřŘsSšŠtTuUvVwWxXyYzZžŽ' | LC_ALL=cs_CZ.UTF-8 sed 's/[^a-z[.ch.]A-Z]/-/g'
aAᴂbBcC--dDeEfFgGhHch---iIjJkKlLmMnNoOpPqQrR--sS--tTuUvVwWxXyYzZ--

I included <U1D02> to show that *other* special characters are still correctly sorted within a-z, it's just that the 5 symbols listed in comment #2 are no longer within that set. Again, I don't plan to fix this for RHEL 8.0 because we want glibc to support rational ranges, most likely in the form of code point ranges.

One consideration is that br_FR is the only other locale to use a ch digraph, and so we might promote the ch digraphs to iso14651_t1_common, and fix this issue for cs_CZ (leaving it broken for br_FR).

echo 'aAᴂbBcCčČdDeEfFgGhHchcHChCHiIjJkKlLmMnNoOpPqQrRřŘsSšŠtTuUvVwWxXyYzZžŽ' | LC_ALL=br_FR.UTF-8 sed 's/[^a-zA-Z]/-/g'
aAᴂbBcCčČdDeEfFgGhH----iIjJkKlLmMnNoOpPqQrRřŘsSšŠtTuUvVwWxXyYzZ--

Again this is more work and we want to go in the other direction, towards rational ranges.