Bug 1696729 - strcoll sort result incorrect locale lv_LV.UTF-8
Summary: strcoll sort result incorrect locale lv_LV.UTF-8
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: rawhide
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Carlos O'Donell
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-05 14:05 UTC by AgrisV
Modified: 2019-11-19 14:23 UTC (History)
9 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-11-19 14:23:43 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
glibc C test file strcoll lv_LV.UTF-8 (966 bytes, text/x-csrc)
2019-04-05 14:05 UTC, AgrisV
no flags Details
lv_LV.UTF-8.in_sorted (330 bytes, application/octet-stream)
2019-04-10 10:42 UTC, AgrisV
no flags Details
lv_LV.UTF-8_with_more_chars_and_removed.in (413 bytes, application/octet-stream)
2019-04-10 10:43 UTC, AgrisV
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Sourceware 25206 0 'P2' 'NEW' 'strcoll sort result incorrect locale lv_LV.UTF-8' 2019-11-19 14:14:26 UTC

Description AgrisV 2019-04-05 14:05:30 UTC
Created attachment 1552522 [details]
glibc C test file strcoll lv_LV.UTF-8

Description of problem:
 strcoll sort result incorrect in many apps (sort textfile,php arrays and more). Tested locale lv_LV.UTF-8. My be other locales affected also.

Version-Release number of selected component (if applicable):
 start from glibc-2.25 wrong results.

How reproducible:
 strcoll(str1, str2) //test file attachment

Steps to Reproduce:
 Compile and run test_strcoll_lv.c with glibc 2.24 - Result correct
 Compile and run test_strcoll_lv.c with glibc > 2.25 - Result incorrect

Actual results:
 str1 is greater than str2 --correct
 str2 is less than str3 --incorrect

 sorted like: aa,ve,āb

Expected results:
 str1 is greater than str2 --correct
 str2 is greater than str3 --correct

 sorted like: aa,āb,ve

Additional info:
 2.17 Centos 7 - Correct
 2.23 Ubuntu 16LTS - Correct
 2.24 Fedora 25 - Correct

 2.25 Fedora 26 - Incorrect
 2.28 Fedora 29 - Incorrect
 2.27 Ubuntu 18LTS - Incorrect

Comment 1 Carlos O'Donell 2019-04-06 08:11:31 UTC
Your expected sort of aa < āb < ve is correct.

Your test case doesn't match these values though. I suggest you review your test case.

I have used the data from your test for my own test.

Running my own test on F29 shows:

"ab" is greater than "āa" (13)
"āa" is less than "ve" (-117)
"ab" is less than "ve" (-117)
"āa" is less than "ab" (-13)
"ve" is greater than "āa" (117)
"ve" is greater than "ab" (117)

Running my own test on F30 shows:

"ab" is greater than "āa" (13)
"āa" is less than "ve" (-117)
"ab" is less than "ve" (-117)
"āa" is less than "ab" (-13)
"ve" is greater than "āa" (117)
"ve" is greater than "ab" (117)

Which results in the following total sort: āa < ab < ve.

This all correct.

This also matches ICU/CLDR:
>>> import icu
>>> collator = icu.Collator.createInstance(icu.Locale('en_US.UTF-8'))
>>> sorted(['a','b','ā','v', 'aa', 'ab', 'āa', 'āb'], key=collator.getSortKey)
['a', '\xc4\x81', 'aa', '\xc4\x81a', 'ab', '\xc4\x81b', 'b', 'v']

a < ā < aa < āa < ab < āb < b < v

Does that answer your question?

Comment 2 AgrisV 2019-04-08 08:35:06 UTC
No in collation lv_LV.UTF-8 alphabetical order are:
a < aa < ab < ā < āa < āb < b < be < v < vē

It is like compare first char by alphabet, then second char, then third...

Alphabet:
a,ā,b,c,č,d,e,ē,f,g,ģ,h,i,ī,j,k,ķ,l,ļ,m,n,ņ,o,p,r,s,š,t,u,ū,v,z,ž

with <= glibc2.24 sort, strcoll return correct order.

No problems when you create array with only one char per string(like alphabet array), but problems start with more than one char per string.

So main problem: I can't order strings by alphabetical(collation lv_LV.UTF-8) order in newer OS/apps/glibc and don't know witch package or source are responsible and then something are changed.
Not work: php strcoll, php icu, C strcoll, C icu, python icu - for all of them are one source of ordering.
I can take old OS with glibc2.24 and order are correct, i can take MS excel and order are correct..

After some days of test, i stoped at glibc, because it sorts with collate and i think it use https://unicode.org/. But nothing are changed at unicode level for lv_LV locale.

Comment 3 Florian Weimer 2019-04-08 08:53:37 UTC
Wikipedia says in <https://en.wikipedia.org/wiki/Latvian_language#Standard_orthography> (without citing sources):

“The vowel letters A, E, I and U can take a macron to show length, unmodified letters being short; these letters are not differentiated while sorting (e.g. in dictionaries).”

It is quite possible that Latvian has multiple incompatible sort orders for different use cases (other languages use different sort orders for phone books and dictionaries, for example).  glibc can only represent one sort order.

Comment 4 AgrisV 2019-04-08 09:14:33 UTC
Yes you are not only one who gives me this link.. i think this order for dictionaries are because for other languages who do not know latvian alphabet.. for better find words..
OK I will try to write to Latvian language center. Will inform you about this wikipedia link and ordering.

Comment 5 Carlos O'Donell 2019-04-08 14:59:05 UTC
(In reply to AgrisV from comment #4)
> Yes you are not only one who gives me this link.. i think this order for
> dictionaries are because for other languages who do not know latvian
> alphabet.. for better find words..
> OK I will try to write to Latvian language center. Will inform you about
> this wikipedia link and ordering.

Writing the Latvian language center is a good way forward. Thank you for doing that. It is often very hard for maintainers, like myself, to interface with such language centers without being a native speaker.

As Florian suggested, this may be an issue of multiple incompatible sortings, and glibc supports only one default sorting. It may be that Latvian users need an alternative sorting. Such a sorting would need an entirely distinct algorithm for sorting. As soon as you have a difference like 'a' vs 'b' the words are sorted differently by the POSIX collation algorithm (glibc does not use UCA, but it's similar).

I suggest you additionally reach out to the CLDR list and ask what's possible to support.

In glibc we can't support an alternate sorting like you suggest, not with the current algorithm and APIs. It is always possible to extend the locales to support multiple distinct sorting, but such an interface or framework doesn't exist and would have to be proposed, written, tested, etc. You can see it would be a lot of work.

Comment 6 AgrisV 2019-04-09 13:21:39 UTC
So i decide to go via CLDR, because this is main point for problem. But i worried about old tickets witch stay opened for 3-4 and more years.. 
For reference:
Here are unicode ticker with last comment with prove the truth of Latvian standards.
https://unicode.org/cldr/trac/ticket/11982#comment:2

If someone know how and how fast they are processing this tickets, let me know.

Comment 7 Carlos O'Donell 2019-04-09 21:37:18 UTC
(In reply to AgrisV from comment #6)
> So i decide to go via CLDR, because this is main point for problem. But i
> worried about old tickets witch stay opened for 3-4 and more years.. 
> For reference:
> Here are unicode ticker with last comment with prove the truth of Latvian
> standards.
> https://unicode.org/cldr/trac/ticket/11982#comment:2
> 
> If someone know how and how fast they are processing this tickets, let me
> know.

Unfortunately Red Hat is not a member of the Unicode consortium, and so we cannot raise issues like this on your behalf.

It's also not clear to me that we can implement the Latvian rules given the UCA or POSIX algorithms.

For example:
~~~
Data sorting and searching rules

* The data containing only letters of the Latvian alphabet are sorted according to the Latvian alphabet from left to right. All letters have equal diacritical weight, but the case is ignored. If two strings differ only by using capital and the same small letter in one position, the string with capital letter is preferred.

* When sorting data from a character set cotnaing the Latvian letters as a subset international sorting rules are used adapting them to the Latvian needs so that the order of strings containing only Latvian letters ir preserved.

* In Latvia the determining international sorting rule for an arbitrary character set is the on, described in [2](English language locale for Denmark).

* The Latvian letters with caron, macaron or cedilla ( A-macron, C-caron,E-macron, G-cedilla, I-macron,K-cedilla,N-cedilla,O-macron,R-cedilla,S-caron,U-macron and Z-caron ) have the same diacritical weight as those without a diacritial mark.

* The capital letters are considered before small ones. 
~~~

The first bullet appears to say that all letters are sorted according to the Latvian alphabet, irrespective of diacritical weights, this is just alphabet sort.
The first bullet also says upper case sorts before lower case, so uppercase listed first.
The fourth bullet appears to say that the *-macron letters have the same sorting weight as those letters without macron, so again sorting is just alphabet sort.
The fourth bullet again says capital letters first.

So when sorting "ab" vs "āa":

- First letter 'a' vs 'ā', 'ā' comes after. Current sort: "ab" > "āa"

To achieve this we must remove 'ā' from the equivalence class of 'a', it must be considered *entirely* distinct from 'a' and sort truly after with a distinct primary weight.

If we do this it will break '[=a=]' and that regexp will not include 'ā', is that OK?

Comment 8 Carlos O'Donell 2019-04-09 21:44:10 UTC
(In reply to Carlos O'Donell from comment #7)
> If we do this it will break '[=a=]' and that regexp will not include 'ā', is
> that OK?

Would you please be able to sort the following file as you expect it?
https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/lv_LV.UTF-8.in;h=db7e83c77e83183ee88eb9769f82a66c4cb758ab;hb=HEAD

Then we can use this as a reference for discussion.

Comment 9 AgrisV 2019-04-10 10:42:35 UTC
Created attachment 1554182 [details]
lv_LV.UTF-8.in_sorted

Comment 10 AgrisV 2019-04-10 10:43:13 UTC
Created attachment 1554183 [details]
lv_LV.UTF-8_with_more_chars_and_removed.in

Comment 11 AgrisV 2019-04-10 10:51:05 UTC
I sorted like i understand. Leave Uppercase after lowercase.
lv_LV.UTF-8.in attachment with only chars which was there.
lv_LV.UTF-8_with_more_chars_and_removed.in Removed chars witch i do non't know and add some missing for better understand.

If till now regexp dose not include some of this chars ā č ē ģ ī ķ ļ ņ š ū ž for equivalence then i think no one in our language use regexp to found them.

Comment 12 AgrisV 2019-04-10 11:39:41 UTC
For example European ordering rules https://en.wikipedia.org/wiki/European_ordering_rules
At primary level they ignore all accent letters in all languages. 
Accents are ordered just in Secondary level.. 
Before i created multiple tickets, many software (witch use ICU) are NOT ordering ā ē ū ī at Secondary level for example icu online demo.

Comment 16 Ben Cotton 2019-10-31 18:45:55 UTC
This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 17 Carlos O'Donell 2019-11-19 14:09:10 UTC
Agris,

Could you please file this bug upstream with glibc under the localedata component?

https://sourceware.org/bugzilla/

Once we have an upstream bug I can engage the community for a broader discussion about fixing this.

Thank you!

Comment 18 Carlos O'Donell 2019-11-19 14:14:27 UTC
Agris,

I have filed a bug for you here:
https://sourceware.org/bugzilla/show_bug.cgi?id=25206

With the relevant details and your sorted files. I'll see what I can do to help upstream.

Comment 19 Florian Weimer 2019-11-19 14:23:43 UTC
This is now being tracked upstream. Thanks.


Note You need to log in before you can comment on or make changes to this bug.