1336308 – INFINITY (∞) and EMPTY SET (∅) are treated as if they were the same character by sort and uniq

Bug 1336308 - INFINITY (∞) and EMPTY SET (∅) are treated as if they were the same character by sort and uniq

Summary: INFINITY (∞) and EMPTY SET (∅) are treated as if they were the same character...

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	27
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Mike FABIAN
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-16 07:06 UTC by Marco Motta
Modified:	2018-05-08 14:18 UTC (History)
CC List:	18 users (show)
Fixed In Version:	glibc-2.27-6.fc28
Clone Of:
Environment:
Last Closed:	2018-04-25 09:12:46 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
minimal example (307 bytes, text/plain) 2016-05-16 08:25 UTC, Kamil Dudka	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Sourceware	18978	0	'P2'	'NEW'	'The collation symbol “UNDEFINED” does not work as specified in the standard'	2019-12-06 16:25:25 UTC

Description Marco Motta 2016-05-16 07:06:46 UTC

Description of problem:

Infinite (∞) and empty set (∅) are treated as if he were the same character to sort and uniq

Version-Release number of selected component (if applicable):

coreutils-8.24-6.fc23.x86_64

How reproducible:

$ (echo "∅"; echo "∞"; echo "∅") | sort
∅
∞
∅

$ (echo "∅"; echo "∞"; echo "∅") | sort | uniq
∅


Steps to Reproduce:
1. Open a terminal (I use gnome-terminal)
2. Tpye the above commands
3. Read output

Actual results:

$ (echo "∅"; echo "∞"; echo "∅") | sort
∅
∞
∅

$ (echo "∅"; echo "∞"; echo "∅") | sort | uniq
∅


Expected results:

$ (echo "∅"; echo "∞"; echo "∅") | sort
∅
∞

$ (echo "∅"; echo "∞"; echo "∅") | sort | uniq
∅
∞


Additional info:

Comment 1 Kamil Dudka 2016-05-16 08:24:47 UTC

This is caused by strcoll(3) comparing those symbols as equal in the UTF-8 locale.  I am switching the component to glibc.  Minimal example attached.

Comment 2 Kamil Dudka 2016-05-16 08:25:25 UTC

Created attachment 1157805 [details]
minimal example

Comment 4 Kamil Dudka 2016-05-16 08:30:31 UTC

$ curl -JO 'https://bugzilla.redhat.com/attachment.cgi?id=1157805'
$ sh bz1336308.c
+ locale
LANG=en_US.utf8
LC_CTYPE=en_US.utf8
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE=en_US.utf8
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
+ gcc bz1336308.c
+ ./a.out
strcoll("∞", "∅") = "0"
+ exit 0

Comment 5 Carlos O'Donell 2016-05-17 05:07:47 UTC

The en_US locale uses ISO/IEC 14651:2011 for collation.

ISO/IEC 14651:2011 doesn't contain collation rules for mathematical symbols, neither do the European Ordering rules (EOR) e.g. strcoll("∞", "∟") = "0".

If you want strict ordering by Unicode code point then you must use the C.utf8 lcoale which has forward sorting based on the code point.

I would have expected the localedata/locales/iso14651_t1_common <SPECIAL> section to cover all of the special characters we may wish to sort, including math characters. However, a quick review shows that it doesn't (despite some comments say that it will, which are probably wrong). I'm surprised that the unspecified characters (from the UTF-8 charmap) aren't simply sorted by code point by default.

Until then, this is a question of doing the upstream work to sort all of unsupported characters by code point, which may need some automation.

Discussion started upstream:
https://www.sourceware.org/ml/libc-alpha/2016-05/msg00325.html

Comment 6 Mike FABIAN 2016-05-17 05:40:04 UTC

(In reply to Carlos O'Donell from comment #5)
> The en_US locale uses ISO/IEC 14651:2011 for collation.
> 
> ISO/IEC 14651:2011 doesn't contain collation rules for mathematical symbols,
> neither do the European Ordering rules (EOR) e.g. strcoll("∞", "∟") = "0".
> 
> If you want strict ordering by Unicode code point then you must use the
> C.utf8 lcoale which has forward sorting based on the code point.

Yes, but the C.utf8 locale does this in a quite silly way by enumerating
all the code points.

The LC_COLLATE part in the source of the C.utf8 locale looks like this:

    LC_COLLATE
    order_start forward
    <U0000>
    ..
    <UFFFF>
    <U10000>
    ..
    <U1FFFF>
    <U20000>
    ..
    <U2FFFF>
    <UE0000>
    ..
    <UEFFFF>
    <UF0000>
    ..
    <UFFFFF>
    <U100000>
    ..
    <U10FFFF>
    UNDEFINED
    order_end
    END LC_COLLATE


If the “UNDEFINED” symbol in that LC_COLLATE definition worked as
specified by POSIX, enumeration all the code points would not be
needed and the binary locale would become much smaller (a few hundred
kilobytes instead of 1.8 megabytes).

And we could easily fix the other locales like the en_US.utf8 locale
mentioned in comment#4 by inserting a UNDEFINED in the locale’s
LC_COLLATE. Some locale sources already use UNDEFINED, but it does not
work as specified.

The specification says:

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html

opengroup> The symbol UNDEFINED shall be interpreted as including all
opengroup> coded character set values not specified explicitly or via
opengroup> the ellipsis symbol. Such characters shall be inserted in
opengroup> the character collation order at the point indicated by the
opengroup> symbol, and in ascending order according to their coded
opengroup> character set values. If no UNDEFINED symbol is specified,
opengroup> and the current coded character set contains characters not
opengroup> specified in this section, the utility shall issue a
opengroup> warning message and place such characters at the end of the
opengroup> character collation order.

But it does not work like that.

I reported a bug about this a while ago:

https://sourceware.org/bugzilla/show_bug.cgi?id=18978

Comment 7 Fedora End Of Life 2016-11-25 09:02:52 UTC

This message is a reminder that Fedora 23 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 23. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '23'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 23 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 8 Kamil Dudka 2016-11-25 09:15:29 UTC

Still reproducible with glibc-2.24-3.fc25.

Comment 9 Fedora End Of Life 2017-11-16 19:15:03 UTC

This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 10 Kamil Dudka 2017-11-19 23:12:34 UTC

Still reproducible with glibc-2.26.9000-26.fc28.

Comment 11 Carlos O'Donell 2017-11-21 23:52:32 UTC

(In reply to Kamil Dudka from comment #10)
> Still reproducible with glibc-2.26.9000-26.fc28.

Reproducible with C.UTF-8?

I expect the answer is yes until we fix the bugs in Fedora's C.UTF-8, but I wanted to double check.

Comment 12 Kamil Dudka 2017-11-22 10:54:27 UTC

(In reply to Carlos O'Donell from comment #11)
> (In reply to Kamil Dudka from comment #10)
> > Still reproducible with glibc-2.26.9000-26.fc28.
> 
> Reproducible with C.UTF-8?

Good question.  It is *not* reproducible with C.UTF-8.  I was trying it with en_US.UTF-8 as in comment #4.

Comment 13 Mike FABIAN 2017-11-22 12:05:19 UTC

(In reply to Kamil Dudka from comment #12)
> (In reply to Carlos O'Donell from comment #11)
> > (In reply to Kamil Dudka from comment #10)
> > > Still reproducible with glibc-2.26.9000-26.fc28.
> > 
> > Reproducible with C.UTF-8?
> 
> Good question.  It is *not* reproducible with C.UTF-8.  I was trying it with
> en_US.UTF-8 as in comment #4.

I think it is not reproducible with C.UTF-8 because C.UTF-8
defines an order for both of these code points, see comment#6.

Comment 14 Mike FABIAN 2017-11-22 12:13:36 UTC

(In reply to Carlos O'Donell from comment #5)

> I would have expected the localedata/locales/iso14651_t1_common <SPECIAL>
> section to cover all of the special characters we may wish to sort,
> including math characters. However, a quick review shows that it doesn't
> (despite some comments say that it will, which are probably wrong). I'm
> surprised that the unspecified characters (from the UTF-8 charmap) aren't
> simply sorted by code point by default.

I should probably try to update the iso14651_t1_common file to
include more stuff, maybe everything from the DUCET?

Comment 15 Mike FABIAN 2018-04-25 09:12:46 UTC

This is fixed in Fedora 28 because of the glibc collation update:

https://sourceware.org/bugzilla/show_bug.cgi?id=14095

Comment 16 Kamil Dudka 2018-04-25 11:20:27 UTC

Appears fixed with glibc-2.27.9000-14.fc29.  Thanks!

Comment 17 Marco Motta 2018-04-25 11:31:48 UTC

Appears fixed fixed to me too:

[marco@localhost ~]$ (echo "∅"; echo "∞"; echo "∅") | sort | uniq
∅
[marco@localhost ~]$ sudo dnf upgrade glibc --releasever 28
[cut]
================================================================================
 pacchetto                Arch          Versione            Repository     Dim.
================================================================================
Aggiornamento in corso:
 glibc                    i686          2.27-8.fc28         fedora        3.4 M
 glibc                    x86_64        2.27-8.fc28         fedora        3.6 M
 glibc-common             x86_64        2.27-8.fc28         fedora        762 k
 glibc-devel              x86_64        2.27-8.fc28         fedora        1.0 M
 glibc-headers            x86_64        2.27-8.fc28         fedora        454 k
 glibc-langpack-en        x86_64        2.27-8.fc28         fedora        803 k
 nss_nis                  x86_64        3.0-3.fc28          fedora         39 k
Installazione dipendenze:
 libnsl                   i686          2.27-8.fc28         fedora         77 k
 libnsl                   x86_64        2.27-8.fc28         fedora         73 k
 libxcrypt                i686          4.0.0-5.fc28        fedora         78 k
     sostituisce  libcrypt.i686 2.26-27.fc27
     sostituisce  libcrypt.x86_64 2.26-27.fc27
 libxcrypt                x86_64        4.0.0-5.fc28        fedora         77 k
     sostituisce  libcrypt.i686 2.26-27.fc27
     sostituisce  libcrypt.x86_64 2.26-27.fc27
 libxcrypt-devel          x86_64        4.0.0-5.fc28        fedora         15 k

Riepilogo della transazione
================================================================================
Installati  5 pacchetti
Aggiornati  7 pacchetti

[cut]

Installati:
  libnsl.i686 2.27-8.fc28                   libnsl.x86_64 2.27-8.fc28          
  libxcrypt.i686 4.0.0-5.fc28               libxcrypt.x86_64 4.0.0-5.fc28      
  libxcrypt-devel.x86_64 4.0.0-5.fc28      

Aggiornati:
  glibc.i686 2.27-8.fc28               glibc.x86_64 2.27-8.fc28                
  glibc-common.x86_64 2.27-8.fc28      glibc-devel.x86_64 2.27-8.fc28          
  glibc-headers.x86_64 2.27-8.fc28     glibc-langpack-en.x86_64 2.27-8.fc28    
  nss_nis.x86_64 3.0-3.fc28           

Fatto!
[marco@localhost ~]$ (echo "∅"; echo "∞"; echo "∅") | sort | uniq
∅
∞
[marco@localhost ~]$

Comment 18 Marco Motta 2018-05-07 15:32:21 UTC

# dnf remove glibc-langpack-it

$ (echo "∅"; echo "∞"; echo "∅") | sort | uniq
∅
∞
Ah here! There is still a problem in italian package (glibc-langpack-it).

Comment 19 Marco Motta 2018-05-07 18:34:16 UTC

If, instead, I reinstall glibc-langpack-it, the bug come back in Fedora 28:

$ (echo "∅"; echo "∞"; echo "∅") | sort | uniq
∅
[marco@localhost ~]$

Comment 20 Mike FABIAN 2018-05-08 14:18:45 UTC

(In reply to Marco Motta from comment #19)
> If, instead, I reinstall glibc-langpack-it, the bug come back in Fedora 28:
> 
> $ (echo "∅"; echo "∞"; echo "∅") | sort | uniq
> ∅
> [marco@localhost ~]$

I cannot reproduce that. It works for me with and without glibc-langpack-it
installed. Running in it_IT.UTF-8 locale does not seem to make a difference.

Note You need to log in before you can comment on or make changes to this bug.