Created attachment 1552522 [details] glibc C test file strcoll lv_LV.UTF-8 Description of problem: strcoll sort result incorrect in many apps (sort textfile,php arrays and more). Tested locale lv_LV.UTF-8. My be other locales affected also. Version-Release number of selected component (if applicable): start from glibc-2.25 wrong results. How reproducible: strcoll(str1, str2) //test file attachment Steps to Reproduce: Compile and run test_strcoll_lv.c with glibc 2.24 - Result correct Compile and run test_strcoll_lv.c with glibc > 2.25 - Result incorrect Actual results: str1 is greater than str2 --correct str2 is less than str3 --incorrect sorted like: aa,ve,āb Expected results: str1 is greater than str2 --correct str2 is greater than str3 --correct sorted like: aa,āb,ve Additional info: 2.17 Centos 7 - Correct 2.23 Ubuntu 16LTS - Correct 2.24 Fedora 25 - Correct 2.25 Fedora 26 - Incorrect 2.28 Fedora 29 - Incorrect 2.27 Ubuntu 18LTS - Incorrect
Your expected sort of aa < āb < ve is correct. Your test case doesn't match these values though. I suggest you review your test case. I have used the data from your test for my own test. Running my own test on F29 shows: "ab" is greater than "āa" (13) "āa" is less than "ve" (-117) "ab" is less than "ve" (-117) "āa" is less than "ab" (-13) "ve" is greater than "āa" (117) "ve" is greater than "ab" (117) Running my own test on F30 shows: "ab" is greater than "āa" (13) "āa" is less than "ve" (-117) "ab" is less than "ve" (-117) "āa" is less than "ab" (-13) "ve" is greater than "āa" (117) "ve" is greater than "ab" (117) Which results in the following total sort: āa < ab < ve. This all correct. This also matches ICU/CLDR: >>> import icu >>> collator = icu.Collator.createInstance(icu.Locale('en_US.UTF-8')) >>> sorted(['a','b','ā','v', 'aa', 'ab', 'āa', 'āb'], key=collator.getSortKey) ['a', '\xc4\x81', 'aa', '\xc4\x81a', 'ab', '\xc4\x81b', 'b', 'v'] a < ā < aa < āa < ab < āb < b < v Does that answer your question?
No in collation lv_LV.UTF-8 alphabetical order are: a < aa < ab < ā < āa < āb < b < be < v < vē It is like compare first char by alphabet, then second char, then third... Alphabet: a,ā,b,c,č,d,e,ē,f,g,ģ,h,i,ī,j,k,ķ,l,ļ,m,n,ņ,o,p,r,s,š,t,u,ū,v,z,ž with <= glibc2.24 sort, strcoll return correct order. No problems when you create array with only one char per string(like alphabet array), but problems start with more than one char per string. So main problem: I can't order strings by alphabetical(collation lv_LV.UTF-8) order in newer OS/apps/glibc and don't know witch package or source are responsible and then something are changed. Not work: php strcoll, php icu, C strcoll, C icu, python icu - for all of them are one source of ordering. I can take old OS with glibc2.24 and order are correct, i can take MS excel and order are correct.. After some days of test, i stoped at glibc, because it sorts with collate and i think it use https://unicode.org/. But nothing are changed at unicode level for lv_LV locale.
Wikipedia says in <https://en.wikipedia.org/wiki/Latvian_language#Standard_orthography> (without citing sources): “The vowel letters A, E, I and U can take a macron to show length, unmodified letters being short; these letters are not differentiated while sorting (e.g. in dictionaries).” It is quite possible that Latvian has multiple incompatible sort orders for different use cases (other languages use different sort orders for phone books and dictionaries, for example). glibc can only represent one sort order.
Yes you are not only one who gives me this link.. i think this order for dictionaries are because for other languages who do not know latvian alphabet.. for better find words.. OK I will try to write to Latvian language center. Will inform you about this wikipedia link and ordering.
(In reply to AgrisV from comment #4) > Yes you are not only one who gives me this link.. i think this order for > dictionaries are because for other languages who do not know latvian > alphabet.. for better find words.. > OK I will try to write to Latvian language center. Will inform you about > this wikipedia link and ordering. Writing the Latvian language center is a good way forward. Thank you for doing that. It is often very hard for maintainers, like myself, to interface with such language centers without being a native speaker. As Florian suggested, this may be an issue of multiple incompatible sortings, and glibc supports only one default sorting. It may be that Latvian users need an alternative sorting. Such a sorting would need an entirely distinct algorithm for sorting. As soon as you have a difference like 'a' vs 'b' the words are sorted differently by the POSIX collation algorithm (glibc does not use UCA, but it's similar). I suggest you additionally reach out to the CLDR list and ask what's possible to support. In glibc we can't support an alternate sorting like you suggest, not with the current algorithm and APIs. It is always possible to extend the locales to support multiple distinct sorting, but such an interface or framework doesn't exist and would have to be proposed, written, tested, etc. You can see it would be a lot of work.
So i decide to go via CLDR, because this is main point for problem. But i worried about old tickets witch stay opened for 3-4 and more years.. For reference: Here are unicode ticker with last comment with prove the truth of Latvian standards. https://unicode.org/cldr/trac/ticket/11982#comment:2 If someone know how and how fast they are processing this tickets, let me know.
(In reply to AgrisV from comment #6) > So i decide to go via CLDR, because this is main point for problem. But i > worried about old tickets witch stay opened for 3-4 and more years.. > For reference: > Here are unicode ticker with last comment with prove the truth of Latvian > standards. > https://unicode.org/cldr/trac/ticket/11982#comment:2 > > If someone know how and how fast they are processing this tickets, let me > know. Unfortunately Red Hat is not a member of the Unicode consortium, and so we cannot raise issues like this on your behalf. It's also not clear to me that we can implement the Latvian rules given the UCA or POSIX algorithms. For example: ~~~ Data sorting and searching rules * The data containing only letters of the Latvian alphabet are sorted according to the Latvian alphabet from left to right. All letters have equal diacritical weight, but the case is ignored. If two strings differ only by using capital and the same small letter in one position, the string with capital letter is preferred. * When sorting data from a character set cotnaing the Latvian letters as a subset international sorting rules are used adapting them to the Latvian needs so that the order of strings containing only Latvian letters ir preserved. * In Latvia the determining international sorting rule for an arbitrary character set is the on, described in [2](English language locale for Denmark). * The Latvian letters with caron, macaron or cedilla ( A-macron, C-caron,E-macron, G-cedilla, I-macron,K-cedilla,N-cedilla,O-macron,R-cedilla,S-caron,U-macron and Z-caron ) have the same diacritical weight as those without a diacritial mark. * The capital letters are considered before small ones. ~~~ The first bullet appears to say that all letters are sorted according to the Latvian alphabet, irrespective of diacritical weights, this is just alphabet sort. The first bullet also says upper case sorts before lower case, so uppercase listed first. The fourth bullet appears to say that the *-macron letters have the same sorting weight as those letters without macron, so again sorting is just alphabet sort. The fourth bullet again says capital letters first. So when sorting "ab" vs "āa": - First letter 'a' vs 'ā', 'ā' comes after. Current sort: "ab" > "āa" To achieve this we must remove 'ā' from the equivalence class of 'a', it must be considered *entirely* distinct from 'a' and sort truly after with a distinct primary weight. If we do this it will break '[=a=]' and that regexp will not include 'ā', is that OK?
(In reply to Carlos O'Donell from comment #7) > If we do this it will break '[=a=]' and that regexp will not include 'ā', is > that OK? Would you please be able to sort the following file as you expect it? https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/lv_LV.UTF-8.in;h=db7e83c77e83183ee88eb9769f82a66c4cb758ab;hb=HEAD Then we can use this as a reference for discussion.
Created attachment 1554182 [details] lv_LV.UTF-8.in_sorted
Created attachment 1554183 [details] lv_LV.UTF-8_with_more_chars_and_removed.in
I sorted like i understand. Leave Uppercase after lowercase. lv_LV.UTF-8.in attachment with only chars which was there. lv_LV.UTF-8_with_more_chars_and_removed.in Removed chars witch i do non't know and add some missing for better understand. If till now regexp dose not include some of this chars ā č ē ģ ī ķ ļ ņ š ū ž for equivalence then i think no one in our language use regexp to found them.
For example European ordering rules https://en.wikipedia.org/wiki/European_ordering_rules At primary level they ignore all accent letters in all languages. Accents are ordered just in Secondary level.. Before i created multiple tickets, many software (witch use ICU) are NOT ordering ā ē ū ī at Secondary level for example icu online demo.
This message is a reminder that Fedora 29 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '29'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 29 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Agris, Could you please file this bug upstream with glibc under the localedata component? https://sourceware.org/bugzilla/ Once we have an upstream bug I can engage the community for a broader discussion about fixing this. Thank you!
Agris, I have filed a bug for you here: https://sourceware.org/bugzilla/show_bug.cgi?id=25206 With the relevant details and your sorted files. I'll see what I can do to help upstream.
This is now being tracked upstream. Thanks.