|Summary:||glibc: Fix C.UTF-8 locale source ellipsis expressions|
|Product:||Red Hat Enterprise Linux 8||Reporter:||gustavo panizzo <gfa> <gfa>|
|Component:||glibc||Assignee:||Carlos O'Donell <codonell>|
|Status:||VERIFIED ---||QA Contact:||qe-baseos-tools|
|Version:||8.2||CC:||ashankar, ayadav, cheimes, codonell, cww, dj, fweimer, igeorgex, jan.steffens, mcepl, mcermak, mfabian, mnewsome, myllynen, pachoramos1, patalber, pfrankli, sean+rh, skolosov, vslavik|
|Fixed In Version:||glibc-2.28-93||Doc Type:||Bug Fix|
A defect in the C.UTF-8 source locale resulted in all Unicode code-points above U+10000 lacking collation weights. As a result all code points above U+10000 do not collate as expected. The C.UTF-8 source locale has been corrected and the newly compiled binary locale now has collation weights for all Unicode code-points. The compiled C.UTF-8 locale is 5.3MiB larger as a result of this fix.
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
|Bug Depends On:||902094, 1646785|
|Bug Blocks:||1594286, 1746913, 1755139, 1471969|
Description gustavo panizzo <gfa> 2016-08-01 05:36:52 UTC
Description of problem: I usually set my scripts to use C.UTF-8 on Debian to be able to have C sort style while using UTF, it works everywhere. Can you backport that locale to rhel 7? Version-Release number of selected component (if applicable): glibc-common-2.17-105.el7 How reproducible: $ LC_ALL=C.UTF-8 ls -bash: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8): No such file or directory -bash: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8) Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Comment 1 gustavo panizzo <gfa> 2016-08-01 05:37:22 UTC
Debian's bug https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=636086 Fedora's bug https://bugzilla.redhat.com/show_bug.cgi?id=902094
Comment 3 Florian Weimer 2016-09-12 09:22:59 UTC
*** Bug 1365486 has been marked as a duplicate of this bug. ***
Comment 9 Carlos O'Donell 2017-10-18 20:42:31 UTC
We have run into a problem during acceptance testing of the C.UTF-8 locale, particularly with regard to full code-point testing for the entire range of UTF-8 code-points which are valid (with no transliteration involved). In particular, upstream bug 21302  shows collation ordering to be incorrect and this is an upstream issue affecting all C.UTF-8 locales with no solution. We cannot deploy C.UTF-8 into RHEL7 until it sorts correctly for all code points, otherwise we risk a collation ordering change between RHEL7 and RHEL8 which impacts database table index generation (postgresql). The team is working on a solution to this problem but it puts the delivery of a C.UTF-8 locale out of the picture for the early handoff for errata for glibc. We did not expect this level of failure from the upstream code. Therefore I'm marking this probational devel_cond_nack=upstream, setting devel_ack?, and rhel-7.5?, while we continue to work on the problem.  https://sourceware.org/bugzilla/show_bug.cgi?id=21302
Comment 10 Carlos O'Donell 2017-11-06 23:53:01 UTC
While we have made significant progress on this issue upstream, and have identified the problem, it turns out there is still quite a bit that needs to change upstream before all the issues with C.UTF-8 are fixed. This doesn't just impact RHEL, it impacts every distribution using C.UTF-8. I have already begun reworking upstream to fix this issue, but the code point collation sorting issue has become significant cleanup upstream to fix the issue. Therefore we are moving this to rhel-7.6 since the newly scoped work is too large. commit 02eec681676c5aabf2eb13b92b1124245d19112f Author: Carlos O'Donell <email@example.com> Date: Tue Oct 17 01:33:42 2017 -0700 localedef: Add --no-warnings/--warnings option From localedef --help: Output control: ... --no-warnings=<warnings> Comma-separated list of warnings to disable; supported warnings are: ascii, intcurrsym ... --warnings=<warnings> Comma-separated list of warnings to enable; supported warnings are: ascii, intcurrsym Locales using SHIFT_JIS and SHIFT_JISX0213 character maps are not ASCII compatible. In order to build locales using these character maps, and have localedef exit with a status of 0, we add new option to localedef to disable or enable specific warnings. The options are --no-warnings and --warnings, to disable and enable specific warnings respectively. The options take a comma-separated list of warning names. The warning names are taken directly from the generated warning. When a warning that can be disabled is issued it will print something like this: foo is not defined [--no-warnings=foo] For the initial implementation we add two controllable warnings; first 'ascii' which is used by the localedata installation makefile target to install SHIFT_JIS and SHIFT_JISX0213-using locales without error; second 'intcurrsym' which allows a program to use a non-standard international currency symbol without triggering a warning. The 'intcurrsym' is useful in the future if country codes are added that are not in our current ISO 4217 list, and the user wants to avoid the warning. Having at least two warnings to control gives an example for how the changes can be extended to more warnings if required in the future. These changes allow ja_JP.SHIFT_JIS and ja_JP.SHIFT_JISX0213 to be compiled without warnings using --no-warnings=ascii. The localedata/Makefile $(INSTALL-SUPPORTED-LOCALES) target is adjusted to automatically add `--no-warnings=ascii` for such charmaps, and likewise localedata/gen-locale.sh is adjusted with similar logic. v2: Bring verbose, be_quiet, and all warning control booleans into record-status.c, and compile this object file to be used by locale, iconv, and localedef. Any users include record-status.h. v3: Fix an instance of boolean coercion in set_warning(). Signed-off-by: Carlos O'Donell <firstname.lastname@example.org> commit 56fa555a834c1536bf8d58c1ac6097f18f0d92b6 Author: Carlos O'Donell <email@example.com> Date: Fri Oct 13 22:44:44 2017 -0700 localedata: Locale and test name are the same. The localedata collation test data is encoded in a particular character set. We rename the test data to match the full locale name with encoding, and adjust the Makefile and sort-test.sh script. This allows us to have a future C.UTF-8 test that is disambiguated from the built-in C locale. Signed-off-by: Carlos O'Donell <firstname.lastname@example.org> commit 337ff3c501f0e1fadd1036b6fa2754cfbb0c29ea Author: Carlos O'Donell <email@example.com> Date: Wed Oct 25 09:06:45 2017 -0700 localedata: Fix unicode-gen check target. After the transition to generating a distinct file for Unicode ctype information e.g. i18n_ctype, the check target was left with the wrong target name. This patch fixes the check target and regenerates the files with more information than previously used, filling in the the LC_IDENTIFICATION data. Tested on x86_64 by regenerating from Unicode source files, and running checks. Tested by subsequently rebuilding all locales. No regressions in testsuite. Signed-off-by: Carlos O'Donell <firstname.lastname@example.org> Reported-by: Rafal Luzynski <email@example.com> commit ea91c315bca91fe8d5c36f1aa1dc98d2f0ab4ef4 Author: Carlos O'Donell <firstname.lastname@example.org> Date: Sat Oct 14 15:38:05 2017 -0700 locale: Don't use \n with record_verbose messages. Recorded verbose messages no longer need to pass \n in their message string since the record_verbose function adds \n to the messages (like error and warnings do also). The avoids seeing a double \n for verbose messages. Signed-off-by: Carlos O'Donell <email@example.com> commit bc3821bb3b19646311d36c82a13b4ce5afea3508 Author: Carlos O'Donell <firstname.lastname@example.org> Date: Fri Oct 13 14:36:23 2017 -0700 locale: No warning for non-symbolic character (bug 22295) In "Is it OK to write ASCII strings directly into locale source files?" https://sourceware.org/ml/libc-alpha/2017-07/msg00807.html there is universal consensus that we do not have to keep writing <Uxxxx> symbolic characters in locale files. Ulrich Drepper's historical comment was that symbolic characters were used for the eventuality of converting the source files to any encoding system. Fast forward to today and UTF-8 is the standard. So the requirement of <Uxxxx> is hard to justify. Zack Weinberg's excellent scripts are coming along we can use these to find instances of human errors in the scripts: https://sourceware.org/ml/libc-alpha/2017-07/msg00860.html https://sourceware.org/ml/libc-alpha/2017-08/msg00136.html It still won't be easy to distinguish from i for í, but that's still the case for <Uxxxx> characters which humans can't read either. Since we all agreed that we should be able to use non-symbolic (<Uxxxx>) characters in locale files, the following change removes the verbose warning that is raised if you use non-symbolic characters in the locale file. Signed-off-by: Carlos O'Donell <email@example.com> commit a3e23a2c1d9e871545c6f438a41fcb8ad429cf70 Author: Carlos O'Donell <firstname.lastname@example.org> Date: Fri Oct 13 14:33:09 2017 -0700 locale: Allow "" int_curr_Symbol (bug 22294) The builtin POSIX locale has "" as the international currency symbol, but a non-builtin locale may not have such a blank int_curr_symbol. Therefore to support non-builtin locales with similar "" int_curr_symbol we adjust the LC_MONETARY parser to allow the normal 4-character int_curr_symbol *and* the empty "" no symbol. Anything else remains invalid. Tested by building all the locales. Tested also with a custom C.UTF-8 locale with "" for int_curr_symbol. Signed-off-by: Carlos O'Donell <email@example.com> commit f16491eb8ebbef402f3da6f4035ce70fe36dec97 Author: Carlos O'Donell <firstname.lastname@example.org> Date: Fri Oct 13 09:54:03 2017 -0700 locale: Fix localedef exit code (Bug 22292) The error and warning handling in localedef, locale, and iconv is a bit of a mess. We use ugly constructs like this: WITH_CUR_LOCALE (error (1, errno, gettext ("\ cannot read character map directory `%s'"), directory)); to issue errors, and read error_message_count directly from the error API to detect errors. The problem with that is that the code also uses error to print warnings, and informative messages. All of this leads to problems where just having warnings will produce an exit status as-if errors had been seen. To fix this situation I have adopted the following high-level changes: * All errors are counted distinctly. * All warnings are counted distinctly. * All informative messages are not counted. * Increasing verbosity cannot generate *more* errors, and it previously did for errors conditional on verbose, this is now fixed. * Increasing verbosity *can* generate *more* warnings. * Making the output quiet cannot generate *fewer* errors, and it previously did for errors conditional on be_quiet, this is now fixed. * Each of error, warning, and informative message has it's own function to call defined in record-status.h, and they are: record_error, record_warning, and record_verbose. * The record_error function always records an error, but conditional on be_quiet may not print it. * The record_warning function always records a warning, but conditional on be_quiet may not print it. * The record_verbose function only prints the verbose message if verbose is true and be_quiet is false. This has allowed the following fix: * Previously any warnings were being treated as errors because they incremented error_message_count, but now we properly return an exit status of 1 if there are warnings but output was generated. All of this allows localedef to correctly decide if errors, or warnings were present, and produce the correct exit code. The locale and iconv programs now also use record-status.h and we have removed the WITH_CUR_LOCALE hack, and instead have internal push_locale/pop_locale functions centralized in the record routines. commit 8dc8be75d2afb7ebaf55f7609b301e5c6b8692e5 Author: Carlos O'Donell <email@example.com> Date: Thu Oct 12 23:52:14 2017 -0700 localedata: Reorganize Unicode LC_CTYPE inclusion. The commit does the following things: * Move non-transliteration Unicode generated data to i18n_ctype. * Copy the i18n_ctype data into i18n and add transliteration. In the future, any locale which needs Unicode LC_CTYPE data can also just use `copy i18n_ctype` and get the base character classes and maps without transliteration. Tested by compiling all the locales and my prototype C.UTF-8 which uses it. Signed-off-by: Carlos O'Donell <firstname.lastname@example.org> Signed-off-by: Carlos O'Donell <email@example.com>
Comment 12 Carlos O'Donell 2018-02-27 17:22:22 UTC
This issue is included in our Red Hat Enterprise Linux 7.6 review, and the robust mutex fixes will be considered for backporting here after we evaluate the risk and depth of the backport.
Comment 14 Carlos O'Donell 2018-05-08 16:17:06 UTC
Work on this issue continues upstream since a full C.UTF-8 locale is going upstream before being backported. The current status upstream is that C.UTF-8 causes some problems with certain locales for full code-point sorting. This is still under review.
Comment 15 Florian Weimer 2018-06-13 08:23:49 UTC
*** Bug 1590680 has been marked as a duplicate of this bug. ***
Comment 19 Carlos O'Donell 2019-06-07 03:55:25 UTC
Given that RHEL 7 is entering maintenance phase 1 at the end of 2019, and this issue doesn't have upstream resolution yet, I'm moving to RHEL 8.0. In RHEL 8.0 we inherited the C.UTF-8 locale from Fedora and this needs fixing. We should fix the code-point issue at a minimum.
Comment 20 Carlos O'Donell 2019-08-29 13:40:44 UTC
The C.UTF-8 locale is already in RHEL 8.0. We are using this bug to track the fix to the elipsis in the sources that need to be correctly specified to provide the full code-point ranges.
Comment 27 Sergey Kolosov 2020-01-21 15:15:43 UTC
Verified, the bug has been fixed in glibc-2.28-93.el8