Created attachment 356828 [details] utf8 text data +++ This bug was initially created as a clone of Bug #144487 +++ Vietnamese collation is broken in locale vi_VN.UTF-8 This seems to be a regression, because it was working after the bug was fixed previously in #144487. I dont see any changes in the vi_VN file would would explain this problem, it may be in another file, or in glibc's collation routines themselves. (this bugzilla doesnt seem to preserve utf-8 well, so the sample data is included as an attachment to this bug) To see the correct sorting order: ibm's unicode project sorts correctly: http://demo.icu-project.org/icu-bin/locexp?_=vi&d_=en&x=col this table is derived from the unicode collation algorithm: http://vietunicode.sourceforge.net/charset/v3.htm using open office calc, the collation order for the vietnamese works correctly. (it doesnt rely on the glibc locales) The command: LC_ALL=vi_VN sort < test_data.txt will generate output that doesnt match any of the others. The others are all consistent with each other and online sources such as dictionaries.
Created attachment 356829 [details] c++ test case program
A regression test might be beneficial for locale collation. This problem doesnt appear with single letter's as the strings, only when usings words with multiple letters. Using strxfrm, i looked at these letters collation seq's: u: 26 01 08 01 02 a_grave: 0C 01 09 01 02 a_hook: 0C 01 0A 01 02 a_grave comes before a_hook, correctly. But: the 2 character string: <u,a_grave> strxfrm's into: 26 0C 01 09 08 01 02 02 the 2character string: <u,a_hook> strxfrm's into: 26 0C 01 08 0A 01 02 02 If the 3rd byte from both second character's collation's were combined in the same order, this would work correctly. But do to the reversal, the string with a_hook moves before the string containing a_grave. This is analogous to ac sorting before ab, despite b normally coming before c. The grave and dotbelow are both affected.
Thanks for pulling me in. I haven't had time to look into i. I confirm the problem only happens with multiple letters (wondering if it happened when I submitted the updated locale because I only tested it with single letters)
Thank you! Looking at this closely, the sample data im using arent actual words, because the accent marks placement rules would put the dau phu on the first vowel and not on the second, in this case. (there are exceptions) I'm going to try to find an actual pair of dictionary words that are affected.
(In reply to comment #4) > Thank you! > > Looking at this closely, the sample data im using arent actual words, because > the accent marks placement rules would put the dau phu on the first vowel and > not on the second, in this case. (there are exceptions) > > I'm going to try to find an actual pair of dictionary words that are affected. I have all you need :-) http://repo.or.cz/w/words-vi.git The word list above was extracted from a Vietnamese dictionary. I may make mistakes typing it but overall it's quite accurate. You can try it and ping me if you suspect some words (or word order) are wrong.
Created attachment 356835 [details] utf8 text data words taken from: http://www.informatik.uni-leipzig.de/~duc/Dict/ words are actual words, that collate incorrectly. should sort as: a,i a_hook,i a_acute,i glibc locale says: a,i 0C 18 01 08 08 01 02 02 a_acute,i 0C 18 01 0A 08 01 02 02 a_hook,i 0C 18 01 08 0C 01 02 02 I read the collation sequences as: 0C == letter a 18 == letter i 01 == end of collation class (letter) 08 == bas 0A == hook 0C == acute 01 == end of collation class (accent) 02 == lowercase So the error is in hook accent: its getting inverted as it if was attached to the letter i rather than the letter a. I found this when reindexing a dictionary. The diff was much larger than i expected, with the headings all mixed up.
Ah, just saw your word list. I'm sure I can find many more examples, if needed
If/When you have something to contribute please coordinate with Pravin (cc:ed now). Pravin knows how the locale extensions have to be written.
Pravin or Ulrich, Is it possible for a locale to jumble the latin accents within a string? This seems erroneous to me: a_hook,i 0C 18 01 08 0C 01 02 02 a i end noaccent hook end lowercase lowercase I think it should be: a_hook,i 0C 18 01 0C 08 01 02 02 a i end hook noaccent end lowercase lowercase
an observation: a, agrave, ahook, atilde, aacute, adotbelow: strxfrm=0C 0C 0C 0C 0C 0C 01 1A 0C 0B 0A 09 08 01 02 02 02 02 02 02 the collating values for the accents are precisely backwards. trying the same string again in en_US.utf8: 0C 0C 0C 0C 0C 0C 01 1A 09 11 17 0A 08 01 02 02 02 02 02 02 This has got to be wrong for en_US.utf8 also...
(In reply to comment #9) > Pravin or Ulrich, > Is it possible for a locale to jumble the latin accents within a string? sorry i did not understood meaning of jumbling exactly here but you can write your custom rules for accent in locale file for vi_VN problem i understood from above comment is data is getting sorting correctly for single letter but it is not sorting properly when we give words. á a ả sorting this giving result: a ả á but when we give ái ai ải sorting this giving result: ai ái ải which is wrong am i upto the point?
You are correct Pravin. The sorting for single letters is correct, but for words it is wrong. By "jumble", I mean that the characters are compared first to last, but the accents are compared last to first. This is backwards. I'm suspicious that the problem might not be in the locale at all, but perhaps its in the latin strxfrm logic, and it may affect all locales that care about accented characters. in vi_VN, the accent orders are: a (08), agrave(09), ahook(0A), atilde(0B), aacute(0C), adotbelow(1A): strxfrm returns = 1A 0C 0B 0A 09 08 But it should be = 08 09 0A 0B 0C 1A Also, en_US.utf8, sorts these two words incorrectly: résume resumé
The rules say that latin diacritics are ordered backwards (except in the de_DE locale).
Hello Andreas. I'd like to know more about the rules you refer to. Based on the unicode standard, the sorting seems incorrect for both en_US and vi_VN. Certainly, for vietnamese the backwards sorting order for latin accents doesnt match dictionaries, common conventions, or the unicode standard. It seems a bit counter-intuitive to me to sort a string's accents RTL... if you have a link or an RFC number I could refer to, I'd love to know more about that.
ISO/IEC 14651:2001
ISO/IEC 14651:2001 Related section, D.2, point 2 It states a rule that only applies a subset of french dictionaries, and isnt universal (even for French). It should not apply to English, or most other languages. The Vietnamese and English collation results are clearly wrong, imo. It looks like this is a bug. If someone knows a way to disable this behavior within a locale file, I'd be glad to use that to mitigate the problem temporarily. But the correct action would seem to be to fix the code that orders the diacriticals. (Perhaps ill peek through the strcoll to see if I can find it... )
Almost nobody (including myself) understands collation. That's something for librarians and language researchers. Don't even think about changing anything for English and other supported languages. If you think Vietnamese isn't handled correctly then by all means, change it. But don't touch anything else. And the strcoll/strxfrm code of course doesn't directly have anything to do with the rules. These functions just interpret data generated by localedef. All the logic is described by the locale files. And if you'd have paid attention to what Andreas wrote you would have immediately been able to spot what has to be done to enable forward-handling of diacrits. The de_DE locale contains define DIACRIT_FORWARD
Created attachment 356908 [details] locale file Thanks Ulrich. Yeah, a quick code review shows that, it looks like strcoll/strxfrm are pretty flexible. I'll focus on vi_VN only then, since thats the part thats giving me trouble. I tried adding "define DIACRIT_FORWARD " and its mostly working. I found a new problem: strcoll and strcmp(strxfrm) are providing different results. (strcmp(strxfrm) works as expected, but strcoll is giving the letter case higher precedence than accents, otherwise correct) Attached is the vi_VN locale file im working with now. I'll keep debugging.
Pravin, Ive almost got it working. I want "Hà Tiên" to sort before "hà tiện". With the attached locale file, strxfrm is working correctly, but strcoll is returning the capital string as more significant than the diacritial. "Hà Tiên" strcoll "hà tiện" = 7 (after/greater) strxfrm Hà Tiên =17 0C 25 18 14 1D 01 08 09 08 08 08 08 01 09 02 09 02 02 02 01 03 30 strxfrm hà tiện =17 0C 25 18 14 1D 01 08 09 08 08 0D 08 01 02 02 02 02 02 02 01 03 30 Do you see what I might be doing wrong? Any input/pointers would be appreciated.
Created attachment 357266 [details] better locale file/Tiếng Việt I figured out "Hà Tiên" strcoll "hà tiện": it was the misclassification of PCT vs BPT. I guess the main latin file had moved on while the viet locale was using the older style, and somehow the interaction messed up strcoll without affecting strxfrm.. mystifying but I'm moving on... After implementing a version of the unicode standard sort algo for vi_VN, and also using a 3rd party version (libicu) and comparing to a dictionary, I have discovered several more deficiencies in the locale vs the standards... the only way to resolve this would be to not include iso14651_t1_common and define vi collation independently... (non-letters being moved to round 4 causes missorting) I now understand why so many projects have moved to (the really slow) libicu and avoided using libc locales for this... but there are enough projects that rely upon native locales or this to be worth fixing, for me at least. 母語の文句は後回しに。 If anyone out there is still listening, I'd appreciate any insight/comments on this latest locale file.
tested locale file you attached with sort function i think now its giving result as per your expectation test data: ái ai ải hà tiện á a Hà Tiên ả o/p: Hà Tiên a ai hà tiện á ái ả ải please test it with bit more data, and attach test_cases and test result as well here, so if someone interested in testing can test same
Created attachment 357347 [details] Unit Test Input to "sort"
Created attachment 357349 [details] expected sorted output of unit test
Created attachment 357350 [details] locale file, seems to be working Pravin, Thank you! I have added unit test data as you have suggested. I have added some unit testing data, and a new locale file. This locale file matches the output of libicu's collation algorithm, tested against every word in a vietnamese dictionary.
yeah, tested locale file with your testing data its working fine. but IMO it will be nice, if you keep line copy "iso14651_t1" and do other thing with reorder-after. iso14651_t1_common file is like a universal collation file and contains collation info of all other scripts. including it helps in keeping collation data of other script also available by selecting vi_VN locale.
Thanks Pravin. I'd like that as well, but I'm suspicious that there may be no way to pass the unit test and still include that iso14651_t1_common. Vietnamese words are mostly compounds, and by ignoring whitespace until round 4, it becomes impossible to sort correctly.
in that case i think for quick fixing this bug we should go with your update on collation, and when someone will find fix w.r.t including iso* file as well we should go with that. Ulrich what you says about this?
Srin, have you talked to the original author? I'm not going to change anything without the author agreeing or being unresponsive.
Ulrich, I havent spoken to the original authors: several of which have touched the locale file over time. Pablo's name is currently in the file itself: If you prefer that he weigh in, I'll forward him a link to here. The latest layout was done by someone else however...
(In reply to comment #29) > Ulrich, > I havent spoken to the original authors: several of which have touched the > locale file over time. Pablo's name is currently in the file itself: If you > prefer that he weigh in, I'll forward him a link to here. The latest layout was > done by someone else however... The last "significant" changes in collation part was from bug 448 [1]. The collation part was modified by Samuel Thibault in that bug. Were you referring him as "someone else"? There are only a few people who touched the collation part, according to glibc.git: - Kentaroh Noji and Tetsuji Orita (original authors) - Pablo Saratxaga - Samuel Thibault [1] http://sources.redhat.com/bugzilla/show_bug.cgi?id=448
It is pretty clear that the currently attached locale file is not usable. You cannot define collation rules only for one language. Using the common file ensures that all the languages are handled. I can imagine the current iso14651_common file is not sufficient. The solution is, though, not to replace the rules but change then using reorder-after etc. There are plenty of examples in the source tree. Pravin (already cc:ed) might be able to help you.
can someone attach a complete sort order required for vi_VN.UTF-8 It should include all characters required for vi_VN.UTF-8, and sorted it in expected order. I will take a look at it
This message is a reminder that Fedora 11 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 11. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '11'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 11's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 11 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.