Description of problem: Infinite (∞) and empty set (∅) are treated as if he were the same character to sort and uniq Version-Release number of selected component (if applicable): coreutils-8.24-6.fc23.x86_64 How reproducible: $ (echo "∅"; echo "∞"; echo "∅") | sort ∅ ∞ ∅ $ (echo "∅"; echo "∞"; echo "∅") | sort | uniq ∅ Steps to Reproduce: 1. Open a terminal (I use gnome-terminal) 2. Tpye the above commands 3. Read output Actual results: $ (echo "∅"; echo "∞"; echo "∅") | sort ∅ ∞ ∅ $ (echo "∅"; echo "∞"; echo "∅") | sort | uniq ∅ Expected results: $ (echo "∅"; echo "∞"; echo "∅") | sort ∅ ∞ $ (echo "∅"; echo "∞"; echo "∅") | sort | uniq ∅ ∞ Additional info:
This is caused by strcoll(3) comparing those symbols as equal in the UTF-8 locale. I am switching the component to glibc. Minimal example attached.
Created attachment 1157805 [details] minimal example
$ curl -JO 'https://bugzilla.redhat.com/attachment.cgi?id=1157805' $ sh bz1336308.c + locale LANG=en_US.utf8 LC_CTYPE=en_US.utf8 LC_NUMERIC="en_US.utf8" LC_TIME="en_US.utf8" LC_COLLATE=en_US.utf8 LC_MONETARY="en_US.utf8" LC_MESSAGES="en_US.utf8" LC_PAPER="en_US.utf8" LC_NAME="en_US.utf8" LC_ADDRESS="en_US.utf8" LC_TELEPHONE="en_US.utf8" LC_MEASUREMENT="en_US.utf8" LC_IDENTIFICATION="en_US.utf8" LC_ALL= + gcc bz1336308.c + ./a.out strcoll("∞", "∅") = "0" + exit 0
The en_US locale uses ISO/IEC 14651:2011 for collation. ISO/IEC 14651:2011 doesn't contain collation rules for mathematical symbols, neither do the European Ordering rules (EOR) e.g. strcoll("∞", "∟") = "0". If you want strict ordering by Unicode code point then you must use the C.utf8 lcoale which has forward sorting based on the code point. I would have expected the localedata/locales/iso14651_t1_common <SPECIAL> section to cover all of the special characters we may wish to sort, including math characters. However, a quick review shows that it doesn't (despite some comments say that it will, which are probably wrong). I'm surprised that the unspecified characters (from the UTF-8 charmap) aren't simply sorted by code point by default. Until then, this is a question of doing the upstream work to sort all of unsupported characters by code point, which may need some automation. Discussion started upstream: https://www.sourceware.org/ml/libc-alpha/2016-05/msg00325.html
(In reply to Carlos O'Donell from comment #5) > The en_US locale uses ISO/IEC 14651:2011 for collation. > > ISO/IEC 14651:2011 doesn't contain collation rules for mathematical symbols, > neither do the European Ordering rules (EOR) e.g. strcoll("∞", "∟") = "0". > > If you want strict ordering by Unicode code point then you must use the > C.utf8 lcoale which has forward sorting based on the code point. Yes, but the C.utf8 locale does this in a quite silly way by enumerating all the code points. The LC_COLLATE part in the source of the C.utf8 locale looks like this: LC_COLLATE order_start forward <U0000> .. <UFFFF> <U10000> .. <U1FFFF> <U20000> .. <U2FFFF> <UE0000> .. <UEFFFF> <UF0000> .. <UFFFFF> <U100000> .. <U10FFFF> UNDEFINED order_end END LC_COLLATE If the “UNDEFINED” symbol in that LC_COLLATE definition worked as specified by POSIX, enumeration all the code points would not be needed and the binary locale would become much smaller (a few hundred kilobytes instead of 1.8 megabytes). And we could easily fix the other locales like the en_US.utf8 locale mentioned in comment#4 by inserting a UNDEFINED in the locale’s LC_COLLATE. Some locale sources already use UNDEFINED, but it does not work as specified. The specification says: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html opengroup> The symbol UNDEFINED shall be interpreted as including all opengroup> coded character set values not specified explicitly or via opengroup> the ellipsis symbol. Such characters shall be inserted in opengroup> the character collation order at the point indicated by the opengroup> symbol, and in ascending order according to their coded opengroup> character set values. If no UNDEFINED symbol is specified, opengroup> and the current coded character set contains characters not opengroup> specified in this section, the utility shall issue a opengroup> warning message and place such characters at the end of the opengroup> character collation order. But it does not work like that. I reported a bug about this a while ago: https://sourceware.org/bugzilla/show_bug.cgi?id=18978
This message is a reminder that Fedora 23 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 23. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '23'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 23 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Still reproducible with glibc-2.24-3.fc25.
This message is a reminder that Fedora 25 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '25'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 25 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Still reproducible with glibc-2.26.9000-26.fc28.
(In reply to Kamil Dudka from comment #10) > Still reproducible with glibc-2.26.9000-26.fc28. Reproducible with C.UTF-8? I expect the answer is yes until we fix the bugs in Fedora's C.UTF-8, but I wanted to double check.
(In reply to Carlos O'Donell from comment #11) > (In reply to Kamil Dudka from comment #10) > > Still reproducible with glibc-2.26.9000-26.fc28. > > Reproducible with C.UTF-8? Good question. It is *not* reproducible with C.UTF-8. I was trying it with en_US.UTF-8 as in comment #4.
(In reply to Kamil Dudka from comment #12) > (In reply to Carlos O'Donell from comment #11) > > (In reply to Kamil Dudka from comment #10) > > > Still reproducible with glibc-2.26.9000-26.fc28. > > > > Reproducible with C.UTF-8? > > Good question. It is *not* reproducible with C.UTF-8. I was trying it with > en_US.UTF-8 as in comment #4. I think it is not reproducible with C.UTF-8 because C.UTF-8 defines an order for both of these code points, see comment#6.
(In reply to Carlos O'Donell from comment #5) > I would have expected the localedata/locales/iso14651_t1_common <SPECIAL> > section to cover all of the special characters we may wish to sort, > including math characters. However, a quick review shows that it doesn't > (despite some comments say that it will, which are probably wrong). I'm > surprised that the unspecified characters (from the UTF-8 charmap) aren't > simply sorted by code point by default. I should probably try to update the iso14651_t1_common file to include more stuff, maybe everything from the DUCET?
This is fixed in Fedora 28 because of the glibc collation update: https://sourceware.org/bugzilla/show_bug.cgi?id=14095
Appears fixed with glibc-2.27.9000-14.fc29. Thanks!
Appears fixed fixed to me too: [marco@localhost ~]$ (echo "∅"; echo "∞"; echo "∅") | sort | uniq ∅ [marco@localhost ~]$ sudo dnf upgrade glibc --releasever 28 [cut] ================================================================================ pacchetto Arch Versione Repository Dim. ================================================================================ Aggiornamento in corso: glibc i686 2.27-8.fc28 fedora 3.4 M glibc x86_64 2.27-8.fc28 fedora 3.6 M glibc-common x86_64 2.27-8.fc28 fedora 762 k glibc-devel x86_64 2.27-8.fc28 fedora 1.0 M glibc-headers x86_64 2.27-8.fc28 fedora 454 k glibc-langpack-en x86_64 2.27-8.fc28 fedora 803 k nss_nis x86_64 3.0-3.fc28 fedora 39 k Installazione dipendenze: libnsl i686 2.27-8.fc28 fedora 77 k libnsl x86_64 2.27-8.fc28 fedora 73 k libxcrypt i686 4.0.0-5.fc28 fedora 78 k sostituisce libcrypt.i686 2.26-27.fc27 sostituisce libcrypt.x86_64 2.26-27.fc27 libxcrypt x86_64 4.0.0-5.fc28 fedora 77 k sostituisce libcrypt.i686 2.26-27.fc27 sostituisce libcrypt.x86_64 2.26-27.fc27 libxcrypt-devel x86_64 4.0.0-5.fc28 fedora 15 k Riepilogo della transazione ================================================================================ Installati 5 pacchetti Aggiornati 7 pacchetti [cut] Installati: libnsl.i686 2.27-8.fc28 libnsl.x86_64 2.27-8.fc28 libxcrypt.i686 4.0.0-5.fc28 libxcrypt.x86_64 4.0.0-5.fc28 libxcrypt-devel.x86_64 4.0.0-5.fc28 Aggiornati: glibc.i686 2.27-8.fc28 glibc.x86_64 2.27-8.fc28 glibc-common.x86_64 2.27-8.fc28 glibc-devel.x86_64 2.27-8.fc28 glibc-headers.x86_64 2.27-8.fc28 glibc-langpack-en.x86_64 2.27-8.fc28 nss_nis.x86_64 3.0-3.fc28 Fatto! [marco@localhost ~]$ (echo "∅"; echo "∞"; echo "∅") | sort | uniq ∅ ∞ [marco@localhost ~]$
# dnf remove glibc-langpack-it $ (echo "∅"; echo "∞"; echo "∅") | sort | uniq ∅ ∞ Ah here! There is still a problem in italian package (glibc-langpack-it).
If, instead, I reinstall glibc-langpack-it, the bug come back in Fedora 28: $ (echo "∅"; echo "∞"; echo "∅") | sort | uniq ∅ [marco@localhost ~]$
(In reply to Marco Motta from comment #19) > If, instead, I reinstall glibc-langpack-it, the bug come back in Fedora 28: > > $ (echo "∅"; echo "∞"; echo "∅") | sort | uniq > ∅ > [marco@localhost ~]$ I cannot reproduce that. It works for me with and without glibc-langpack-it installed. Running in it_IT.UTF-8 locale does not seem to make a difference.