Bug 1653745 - glibc: Czech digraph 'ch' no longer in range [a-z]
Summary: glibc: Czech digraph 'ch' no longer in range [a-z]
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: glibc
Version: 8.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: 8.0
Assignee: Carlos O'Donell
QA Contact: qe-baseos-tools-bugs
URL:
Whiteboard:
Depends On: 587360
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-27 14:32 UTC by Sergey Kolosov
Modified: 2023-07-20 12:55 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 587360
Environment:
Last Closed: 2018-12-11 16:35:20 UTC
Type: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1601681 1 None None None 2023-07-18 14:30:35 UTC
Sourceware 23393 0 None None None 2019-01-03 15:08:55 UTC

Internal Links: 1601681

Comment 1 Florian Weimer 2018-11-27 14:39:49 UTC
echo abcdefghchijklmnopqrstuvwxyz | LC_ALL=cs_CZ.UTF-8 sed 's/[^a-z]/-/g'

should print:

abcdefghchijklmnopqrstuvwxyz

and not:

abcdefgh-ijklmnopqrstuvwxyz

The latter is the result on glibc-2.28-35.el8.

This is not a regression; the regression test did not run in the correct environment before.

Comment 2 Carlos O'Donell 2018-12-01 03:00:51 UTC
The test bz587360-digraph-matching-differs-across-archs must be fixed to look for the same result across all arches *not* a particular result across all arches.

Ranges are only defined for the C/POSIX locales, and are unspecified for any other locales including cs_CZ.UTF-8.

The behaviour we see here is a result of upstream commit 159738548130d5ac4fe6178977e940ed5f8cfdc4 to support language collation across all languages after the ISO 14651 update (Unicode 9).

For example:
echo 'abcčČdefghchcHChCHijklmnopqrřŘsšŠtuvwxyzžŽ' | LC_ALL=cs_CZ.UTF-8 sed 's/[^a-z]/-/g'
abc--defgh----ijklmnopqr--s--tuvwxyz--

This is the expected behaviour in RHEL 8. Note that all 4 digraphs are counted correctly and only 4 dashes are present.

glibc changes the collation element ordering (CEO) for <c-caron>, <ch-digraph>, <r-caron>, <s-caron>, and <z-caron>, and so these 5 symbols will no longer sort within [a-z]. To make them sort within [a-z] requires rewriting the cs_CZ locale to ensure CEO includes the above elements symbols with all the other exepected symbols in the range. This expectation would duplicate all symbols into the cs_CZ locale, making it significantly bigger and taking up more memory. This is not the direction we want to go.

In Bug 1601681 we fixed the most serious CEO issues by fixing the collation element ordering for the latin symbol set, but this does not include digraphs. Again, we fixed latin symbols because there was an expectation that [a-z] did not include any of A-Z, like it *might* in some collations.

Given that we are moving to rational ranges for *all* languages, I don't see that we want to fix this and instead we want to leave it as-is, because this is the behaviour you will get with rational ranges e.g. [a-z], [A-Z], [0-9] will only include the symbols 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' and '0123456789'.

In summary
- Range expressions like [a-z] are unspecified for any locale but POSIX.
- We want to move towards rational ranges and 'ch' will not match [a-z] either for a rational range.
- The test bz587360-digraph-matching-differs-across-archs tests for undefined behaviour, it should be adjusted to look for the same result across all arches, and not a particular result.

Comment 3 Carlos O'Donell 2018-12-01 03:51:30 UTC
Notes:

If you need a-z *and* ch, then you need to use individual collation elements:

echo 'aAᴂbBcCčČdDeEfFgGhHchcHChCHiIjJkKlLmMnNoOpPqQrRřŘsSšŠtTuUvVwWxXyYzZžŽ' | LC_ALL=cs_CZ.UTF-8 sed 's/[^a-z[.ch.]A-Z]/-/g'
aAᴂbBcC--dDeEfFgGhHch---iIjJkKlLmMnNoOpPqQrR--sS--tTuUvVwWxXyYzZ--

I included <U1D02> to show that *other* special characters are still correctly sorted within a-z, it's just that the 5 symbols listed in comment #2 are no longer within that set. Again, I don't plan to fix this for RHEL 8.0 because we want glibc to support rational ranges, most likely in the form of code point ranges.

One consideration is that br_FR is the only other locale to use a ch digraph, and so we might promote the ch digraphs to iso14651_t1_common, and fix this issue for cs_CZ (leaving it broken for br_FR).

echo 'aAᴂbBcCčČdDeEfFgGhHchcHChCHiIjJkKlLmMnNoOpPqQrRřŘsSšŠtTuUvVwWxXyYzZžŽ' | LC_ALL=br_FR.UTF-8 sed 's/[^a-zA-Z]/-/g'
aAᴂbBcCčČdDeEfFgGhH----iIjJkKlLmMnNoOpPqQrRřŘsSšŠtTuUvVwWxXyYzZ--

Again this is more work and we want to go in the other direction, towards rational ranges.


Note You need to log in before you can comment on or make changes to this bug.