Bug 1361965

Summary:	glibc: Fix C.UTF-8 locale source ellipsis expressions
Product:	Red Hat Enterprise Linux 8	Reporter:	gustavo panizzo <gfa> <gfa>
Component:	glibc	Assignee:	Carlos O'Donell <codonell>
Status:	CLOSED ERRATA	QA Contact:	qe-baseos-tools-bugs
Severity:	low	Docs Contact:	Oss Tikhomirova <otikhomi>
Priority:	medium
Version:	8.2	CC:	ashankar, ayadav, cheimes, codonell, cww, dj, fweimer, igeorgex, jan.steffens, mcepl, mcermak, mfabian, mnewsome, myllynen, otikhomi, pachoramos1, patalber, pfrankli, sean+rh, skolosov, vslavik
Target Milestone:	rc	Keywords:	FutureFeature
Target Release:	8.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	glibc-2.28-93	Doc Type:	Bug Fix
Doc Text:	.C.UTF-8 locale source ellipsis expressions in `glibc` are fixed Previously, a defect in the C.UTF-8 source locale resulted in all Unicode code points above U+10000 lacking collation weights. As a consequence, all code points above U+10000 did not collate as expected. The C.UTF-8 source locale has been corrected, and the newly compiled binary locale now has collation weights for all Unicode code points. The compiled C.UTF-8 locale is 5.3MiB larger as a result of this fix.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-28 16:50:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	902094, 1646785
Bug Blocks:	1471969, 1594286, 1746913, 1755139

Description gustavo panizzo <gfa> 2016-08-01 05:36:52 UTC

Description of problem:
I usually set my scripts to use C.UTF-8 on Debian to be able to have C sort style while using UTF, it works everywhere. 
Can you backport that locale to rhel 7?

Version-Release number of selected component (if applicable):
glibc-common-2.17-105.el7

How reproducible:

$ LC_ALL=C.UTF-8 ls
-bash: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8): No such file or directory
-bash: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8)


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 gustavo panizzo <gfa> 2016-08-01 05:37:22 UTC

Debian's bug
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=636086

Fedora's bug
https://bugzilla.redhat.com/show_bug.cgi?id=902094

Comment 3 Florian Weimer 2016-09-12 09:22:59 UTC

*** Bug 1365486 has been marked as a duplicate of this bug. ***

Comment 9 Carlos O'Donell 2017-10-18 20:42:31 UTC

We have run into a problem during acceptance testing of the C.UTF-8 locale, particularly with regard to full code-point testing for the entire range of UTF-8 code-points which are valid (with no transliteration involved).

In particular, upstream bug 21302 [1] shows collation ordering to be incorrect and this is an upstream issue affecting all C.UTF-8 locales with no solution. We cannot deploy C.UTF-8 into RHEL7 until it sorts correctly for all code points, otherwise we risk a collation ordering change between RHEL7 and RHEL8 which impacts database table index generation (postgresql).

The team is working on a solution to this problem but it puts the delivery of a C.UTF-8 locale out of the picture for the early handoff for errata for glibc. We did not expect this level of failure from the upstream code.

Therefore I'm marking this probational devel_cond_nack=upstream, setting devel_ack?, and rhel-7.5?, while we continue to work on the problem.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=21302

Comment 10 Carlos O'Donell 2017-11-06 23:53:01 UTC

While we have made significant progress on this issue upstream, and have identified the problem, it turns out there is still quite a bit that needs to change upstream before all the issues with C.UTF-8 are fixed. This doesn't just impact RHEL, it impacts every distribution using C.UTF-8.

I have already begun reworking upstream to fix this issue, but the code point collation sorting issue has become significant cleanup upstream to fix the issue. Therefore we are moving this to rhel-7.6 since the newly scoped work is too large.

commit 02eec681676c5aabf2eb13b92b1124245d19112f
Author: Carlos O'Donell <carlos>
Date:   Tue Oct 17 01:33:42 2017 -0700

    localedef: Add --no-warnings/--warnings option
    
    From localedef --help:
    
    Output control:
    ...
          --no-warnings=<warnings>   Comma-separated list of warnings to disable;
                                 supported warnings are: ascii, intcurrsym
    ...
          --warnings=<warnings>  Comma-separated list of warnings to enable;
                                 supported warnings are: ascii, intcurrsym
    
    Locales using SHIFT_JIS and SHIFT_JISX0213 character maps are not ASCII
    compatible. In order to build locales using these character maps, and
    have localedef exit with a status of 0, we add new option to localedef
    to disable or enable specific warnings. The options are --no-warnings
    and --warnings, to disable and enable specific warnings respectively.
    The options take a comma-separated list of warning names. The warning
    names are taken directly from the generated warning.  When a warning
    that can be disabled is issued it will print something like this: foo is
    not defined [--no-warnings=foo]
    
    For the initial implementation we add two controllable warnings; first
    'ascii' which is used by the localedata installation makefile target to
    install SHIFT_JIS and SHIFT_JISX0213-using locales without error; second
    'intcurrsym' which allows a program to use a non-standard international
    currency symbol without triggering a warning.  The 'intcurrsym' is
    useful in the future if country codes are added that are not in our
    current ISO 4217 list, and the user wants to avoid the warning. Having
    at least two warnings to control gives an example for how the changes
    can be extended to more warnings if required in the future.
    
    These changes allow ja_JP.SHIFT_JIS and ja_JP.SHIFT_JISX0213 to be
    compiled without warnings using --no-warnings=ascii. The
    localedata/Makefile $(INSTALL-SUPPORTED-LOCALES) target is adjusted to
    automatically add `--no-warnings=ascii` for such charmaps, and likewise
    localedata/gen-locale.sh is adjusted with similar logic.
    
    v2: Bring verbose, be_quiet, and all warning control booleans into
    record-status.c, and compile this object file to be used by locale,
    iconv, and localedef. Any users include record-status.h.
    v3: Fix an instance of boolean coercion in set_warning().

    Signed-off-by: Carlos O'Donell <carlos>

commit 56fa555a834c1536bf8d58c1ac6097f18f0d92b6
Author: Carlos O'Donell <carlos>
Date:   Fri Oct 13 22:44:44 2017 -0700

    localedata: Locale and test name are the same.
    
    The localedata collation test data is encoded in a particular
    character set. We rename the test data to match the full locale
    name with encoding, and adjust the Makefile and sort-test.sh
    script. This allows us to have a future C.UTF-8 test that is
    disambiguated from the built-in C locale.
    
    Signed-off-by: Carlos O'Donell <carlos>

commit 337ff3c501f0e1fadd1036b6fa2754cfbb0c29ea
Author: Carlos O'Donell <carlos>
Date:   Wed Oct 25 09:06:45 2017 -0700

    localedata: Fix unicode-gen check target.
    
    After the transition to generating a distinct file for Unicode ctype
    information e.g. i18n_ctype, the check target was left with the wrong
    target name. This patch fixes the check target and regenerates the
    files with more information than previously used, filling in the the
    LC_IDENTIFICATION data.
    
    Tested on x86_64 by regenerating from Unicode source files, and
    running checks. Tested by subsequently rebuilding all locales.
    No regressions in testsuite.
    
    Signed-off-by: Carlos O'Donell <carlos>
    Reported-by: Rafal Luzynski <digitalfreak>

commit ea91c315bca91fe8d5c36f1aa1dc98d2f0ab4ef4
Author: Carlos O'Donell <carlos>
Date:   Sat Oct 14 15:38:05 2017 -0700

    locale: Don't use \n with record_verbose messages.
    
    Recorded verbose messages no longer need to pass \n in their
    message string since the record_verbose function adds \n to
    the messages (like error and warnings do also). The avoids
    seeing a double \n for verbose messages.
    
    Signed-off-by: Carlos O'Donell <carlos>

commit bc3821bb3b19646311d36c82a13b4ce5afea3508
Author: Carlos O'Donell <carlos>
Date:   Fri Oct 13 14:36:23 2017 -0700

    locale: No warning for non-symbolic character (bug 22295)
    
    In "Is it OK to write ASCII strings directly into locale source files?"
    https://sourceware.org/ml/libc-alpha/2017-07/msg00807.html there is
    universal consensus that we do not have to keep writing <Uxxxx> symbolic
    characters in locale files.
    
    Ulrich Drepper's historical comment was that symbolic characters were
    used for the eventuality of converting the source files to any encoding
    system. Fast forward to today and UTF-8 is the standard. So the
    requirement of <Uxxxx> is hard to justify.
    
    Zack Weinberg's excellent scripts are coming along we can use these to
    find instances of human errors in the scripts:
    https://sourceware.org/ml/libc-alpha/2017-07/msg00860.html
    https://sourceware.org/ml/libc-alpha/2017-08/msg00136.html
    
    It still won't be easy to distinguish from i for í, but that's still the
    case for <Uxxxx> characters which humans can't read either.
    
    Since we all agreed that we should be able to use non-symbolic (<Uxxxx>)
    characters in locale files, the following change removes the verbose
    warning that is raised if you use non-symbolic characters in the locale
    file.
    
    Signed-off-by: Carlos O'Donell <carlos>

commit a3e23a2c1d9e871545c6f438a41fcb8ad429cf70
Author: Carlos O'Donell <carlos>
Date:   Fri Oct 13 14:33:09 2017 -0700

    locale: Allow "" int_curr_Symbol (bug 22294)
    
    The builtin POSIX locale has "" as the international currency symbol,
    but a non-builtin locale may not have such a blank int_curr_symbol.
    
    Therefore to support non-builtin locales with similar "" int_curr_symbol
    we adjust the LC_MONETARY parser to allow the normal 4-character
    int_curr_symbol *and* the empty "" no symbol. Anything else remains
    invalid.
    
    Tested by building all the locales.  Tested also with a custom C.UTF-8
    locale with "" for int_curr_symbol.
    
    Signed-off-by: Carlos O'Donell <carlos>

commit f16491eb8ebbef402f3da6f4035ce70fe36dec97
Author: Carlos O'Donell <carlos>
Date:   Fri Oct 13 09:54:03 2017 -0700

    locale: Fix localedef exit code (Bug 22292)
    
    The error and warning handling in localedef, locale, and iconv
    is a bit of a mess.
    
    We use ugly constructs like this:
          WITH_CUR_LOCALE (error (1, errno, gettext ("\
    cannot read character map directory `%s'"), directory));
    
    to issue errors, and read error_message_count directly from the
    error API to detect errors. The problem with that is that the
    code also uses error to print warnings, and informative messages.
    All of this leads to problems where just having warnings will
    produce an exit status as-if errors had been seen.
    
    To fix this situation I have adopted the following high-level
    changes:
    * All errors are counted distinctly.
    * All warnings are counted distinctly.
    * All informative messages are not counted.
    * Increasing verbosity cannot generate *more* errors, and
      it previously did for errors conditional on verbose,
      this is now fixed.
    * Increasing verbosity *can* generate *more* warnings.
    * Making the output quiet cannot generate *fewer* errors,
      and it previously did for errors conditional on be_quiet,
      this is now fixed.
    * Each of error, warning, and informative message has it's
      own function to call defined in record-status.h, and they
      are: record_error, record_warning, and record_verbose.
    * The record_error function always records an error, but
      conditional on be_quiet may not print it.
    * The record_warning function always records a warning,
      but conditional on be_quiet may not print it.
    * The record_verbose function only prints the verbose
      message if verbose is true and be_quiet is false.
    
    This has allowed the following fix:

    * Previously any warnings were being treated as errors
      because they incremented error_message_count, but now
      we properly return an exit status of 1 if there are
      warnings but output was generated.

    All of this allows localedef to correctly decide if errors,
    or warnings were present, and produce the correct exit code.
    
    The locale and iconv programs now also use record-status.h
    and we have removed the WITH_CUR_LOCALE hack, and instead
    have internal push_locale/pop_locale functions centralized
    in the record routines.

commit 8dc8be75d2afb7ebaf55f7609b301e5c6b8692e5
Author: Carlos O'Donell <carlos>
Date:   Thu Oct 12 23:52:14 2017 -0700

    localedata: Reorganize Unicode LC_CTYPE inclusion.
    
    The commit does the following things:
    
    * Move non-transliteration Unicode generated data to i18n_ctype.
    * Copy the i18n_ctype data into i18n and add transliteration.
    
    In the future, any locale which needs Unicode LC_CTYPE data can
    also just use `copy i18n_ctype` and get the base character classes
    and maps without transliteration.
    
    Tested by compiling all the locales and my prototype C.UTF-8 which
    uses it.
    
    Signed-off-by: Carlos O'Donell <carlos>

    Signed-off-by: Carlos O'Donell <carlos>

Comment 12 Carlos O'Donell 2018-02-27 17:22:22 UTC

This issue is included in our Red Hat Enterprise Linux 7.6 review, and the robust mutex fixes will be considered for backporting here after we evaluate the risk and depth of the backport.

Comment 14 Carlos O'Donell 2018-05-08 16:17:06 UTC

Work on this issue continues upstream since a full C.UTF-8 locale is going upstream before being backported. The current status upstream is that C.UTF-8 causes some problems with certain locales for full code-point sorting. This is still under review.

Comment 15 Florian Weimer 2018-06-13 08:23:49 UTC

*** Bug 1590680 has been marked as a duplicate of this bug. ***

Comment 19 Carlos O'Donell 2019-06-07 03:55:25 UTC

Given that RHEL 7 is entering maintenance phase 1 at the end of 2019, and this issue doesn't have upstream resolution yet, I'm moving to RHEL 8.0. In RHEL 8.0 we inherited the C.UTF-8 locale from Fedora and this needs fixing. We should fix the code-point issue at a minimum.

Comment 20 Carlos O'Donell 2019-08-29 13:40:44 UTC

The C.UTF-8 locale is already in RHEL 8.0.

We are using this bug to track the fix to the elipsis in the sources that need to be correctly specified to provide the full code-point ranges.

Comment 27 Sergey Kolosov 2020-01-21 15:15:43 UTC

Verified, the bug has been fixed in glibc-2.28-93.el8

Comment 28 Oss Tikhomirova 2020-03-24 00:07:42 UTC

Hi Carlos,

I'm collecting the RHEL 8.2 release notes. Thank you for providing this perfect release note text. I’ve only changed a couple of words to meet style guides requirements. Please let me know if you feel like rephrasing or adding anything.



.C.UTF-8 locale source ellipsis expressions in `glibc` are fixed

A defect in the C.UTF-8 source locale resulted in all Unicode code points above U+10000 lacking collation weights. As a consequence, all code points above U+10000 did not collate as expected. The C.UTF-8 source locale has been corrected, and the newly compiled binary locale now has collation weights for all Unicode code points. The compiled C.UTF-8 locale is 5.3MiB larger as a result of this fix.

Comment 29 Carlos O'Donell 2020-03-24 17:02:29 UTC

(In reply to Oss Tikhomirova from comment #28)
> Hi Carlos,
> 
> I'm collecting the RHEL 8.2 release notes. Thank you for providing this
> perfect release note text. I’ve only changed a couple of words to meet style
> guides requirements. Please let me know if you feel like rephrasing or
> adding anything.
> 
> 
> 
> .C.UTF-8 locale source ellipsis expressions in `glibc` are fixed
> 
> A defect in the C.UTF-8 source locale resulted in all Unicode code points
> above U+10000 lacking collation weights. As a consequence, all code points
> above U+10000 did not collate as expected. The C.UTF-8 source locale has
> been corrected, and the newly compiled binary locale now has collation
> weights for all Unicode code points. The compiled C.UTF-8 locale is 5.3MiB
> larger as a result of this fix.

Looks perfect.

Comment 31 errata-xmlrpc 2020-04-28 16:50:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1828