516467 – Vietnamese Locale does not collate correctly

Bug 516467 - Vietnamese Locale does not collate correctly

Summary: Vietnamese Locale does not collate correctly

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	11
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Andreas Schwab
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-08-10 00:04 UTC by Srin Tuar
Modified:	2010-06-28 14:03 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:	144487
Environment:
Last Closed:	2010-06-28 14:03:21 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
utf8 text data (25 bytes, text/plain) 2009-08-10 00:04 UTC, Srin Tuar	no flags	Details
c++ test case program (1.96 KB, text/plain) 2009-08-10 00:05 UTC, Srin Tuar	no flags	Details
utf8 text data (12 bytes, text/plain) 2009-08-10 01:57 UTC, Srin Tuar	no flags	Details
locale file (10.18 KB, text/plain) 2009-08-10 15:43 UTC, Srin Tuar	no flags	Details
better locale file/Tiếng Việt (10.09 KB, text/plain) 2009-08-13 02:29 UTC, Srin Tuar	no flags	Details
Unit Test Input to "sort" (108 bytes, text/plain) 2009-08-13 15:43 UTC, Srin Tuar	no flags	Details
expected sorted output of unit test (108 bytes, text/plain) 2009-08-13 15:44 UTC, Srin Tuar	no flags	Details
locale file, seems to be working (16.70 KB, text/plain) 2009-08-13 15:46 UTC, Srin Tuar	no flags	Details
Show Obsolete (5) View All

Description Srin Tuar 2009-08-10 00:04:36 UTC

Created attachment 356828 [details]
utf8 text data

+++ This bug was initially created as a clone of Bug #144487 +++

Vietnamese collation is broken in locale vi_VN.UTF-8
This seems to be a regression, because it was working after the
bug was fixed previously in #144487. I dont see any changes
in the vi_VN file would would explain this problem, it may be
in another file, or in glibc's collation routines themselves.

(this bugzilla doesnt seem to preserve utf-8 well, so the sample
data is included as an attachment to this bug)

To see the correct sorting order:
 ibm's unicode project sorts correctly:
 http://demo.icu-project.org/icu-bin/locexp?_=vi&d_=en&x=col
 this table is derived from the unicode collation algorithm:
 http://vietunicode.sourceforge.net/charset/v3.htm
 using open office calc, the collation order for the vietnamese
  works correctly. (it doesnt rely on the glibc locales)

The command:
LC_ALL=vi_VN sort < test_data.txt

will generate output that doesnt match any of the others. The others
are all consistent with each other and online sources such as dictionaries.

Comment 1 Srin Tuar 2009-08-10 00:05:34 UTC

Created attachment 356829 [details]
c++ test case program

Comment 2 Srin Tuar 2009-08-10 00:18:17 UTC

A regression test might be beneficial for locale collation.
This problem doesnt appear with single letter's as the strings, 
only when usings words with multiple letters.

Using strxfrm, i looked at these letters collation seq's:

u:       26 01 08 01 02
a_grave: 0C 01 09 01 02
a_hook:  0C 01 0A 01 02

a_grave comes before a_hook, correctly.

But:
the 2 character string: <u,a_grave> strxfrm's into:
    26 0C 01 09 08 01 02 02
the 2character string: <u,a_hook> strxfrm's into:
    26 0C 01 08 0A 01 02 02

If the 3rd byte from both second character's collation's were combined in 
the same order, this would work correctly. But do to the reversal, 
the string with a_hook moves before the string containing a_grave.

This is analogous to ac sorting before ab, despite b normally coming before c.

The grave and dotbelow are both affected.

Comment 3 Nguyen Thai Ngoc Duy 2009-08-10 01:28:53 UTC

Thanks for pulling me in. I haven't had time to look into i. I confirm the problem only happens with multiple letters (wondering if it happened when I submitted the updated locale because I only tested it with single letters)

Comment 4 Srin Tuar 2009-08-10 01:38:16 UTC

Thank you!

Looking at this closely, the sample data im using arent actual words, because
the accent marks placement rules would put the dau phu on the first vowel and
not on the second, in this case. (there are exceptions)

I'm going to try to find an actual pair of dictionary words that are affected.

Comment 5 Nguyen Thai Ngoc Duy 2009-08-10 01:43:28 UTC

(In reply to comment #4)
> Thank you!
> 
> Looking at this closely, the sample data im using arent actual words, because
> the accent marks placement rules would put the dau phu on the first vowel and
> not on the second, in this case. (there are exceptions)
> 
> I'm going to try to find an actual pair of dictionary words that are affected.  

I have all you need  :-) http://repo.or.cz/w/words-vi.git

The word list above was extracted from a Vietnamese dictionary. I may make mistakes typing it but overall it's quite accurate. You can try it and ping me if you suspect some words (or word order) are wrong.

Comment 6 Srin Tuar 2009-08-10 01:57:46 UTC

Created attachment 356835 [details]
utf8 text data

words taken from:
http://www.informatik.uni-leipzig.de/~duc/Dict/

words are actual words, that collate incorrectly.

should sort as:
a,i       
a_hook,i
a_acute,i

glibc locale says:
a,i              0C 18 01 08 08 01 02 02
a_acute,i        0C 18 01 0A 08 01 02 02
a_hook,i         0C 18 01 08 0C 01 02 02

I read the collation sequences as:
0C == letter a
18 == letter i
01 == end of collation class (letter)
08 == bas
0A == hook
0C == acute
01 == end of collation class (accent)
02 == lowercase

So the error is in hook accent: its getting inverted as it if was attached to the letter i rather than the letter a.

I found this when reindexing a dictionary. The diff was much larger than i expected, with the headings all mixed up.

Comment 7 Srin Tuar 2009-08-10 01:59:37 UTC

Ah, just saw your word list. I'm sure I can find many more examples, if needed

Comment 8 Ulrich Drepper 2009-08-10 02:06:33 UTC

If/When you have something to contribute please coordinate with Pravin (cc:ed now).  Pravin knows how the locale extensions have to be written.

Comment 9 Srin Tuar 2009-08-10 02:48:02 UTC

Pravin or Ulrich,
Is it possible for a locale to jumble the latin accents within a string?

This seems erroneous to me:
 a_hook,i 
   0C 18 01     08      0C   01   02 02
    a  i end  noaccent  hook end  lowercase lowercase

 I think it should be:
 a_hook,i 
   0C 18 01     0C    08        01   02 02
    a  i end  hook   noaccent   end  lowercase lowercase

Comment 10 Srin Tuar 2009-08-10 04:39:21 UTC

an observation:

a, agrave, ahook, atilde, aacute, adotbelow:
strxfrm=0C 0C 0C 0C 0C 0C 01 1A 0C 0B 0A 09 08 01 02 02 02 02 02 02 

the collating values for the accents are precisely backwards.

trying the same string again in en_US.utf8:
0C 0C 0C 0C 0C 0C 01 1A 09 11 17 0A 08 01 02 02 02 02 02 02

This has got to be wrong for en_US.utf8 also...

Comment 11 Pravin Satpute 2009-08-10 06:52:20 UTC

(In reply to comment #9)
> Pravin or Ulrich,
> Is it possible for a locale to jumble the latin accents within a string?
sorry i did not understood meaning of jumbling exactly here
but you can write your custom rules for accent in locale file for vi_VN

problem i understood from above comment is 
data is getting sorting correctly for single letter but it is not sorting properly when we give words.

á
a
ả
sorting this giving result:
a
ả
á


but when we give

ái
ai
ải

sorting this giving result:

ai
ái
ải

which is wrong 

am i upto the point?

Comment 12 Srin Tuar 2009-08-10 12:21:01 UTC

You are correct Pravin. The sorting for single letters is correct, but for words it is wrong. By "jumble", I mean that the characters are compared first to last, but the accents are compared last to first. This is backwards.

I'm suspicious that the problem might not be in the locale at all, but perhaps its in the latin strxfrm logic, and it may affect all locales that care about accented characters.

in vi_VN, the accent orders are:
a (08), agrave(09), ahook(0A), atilde(0B), aacute(0C), adotbelow(1A):
strxfrm returns   = 1A 0C 0B 0A 09 08
But it should be  = 08 09 0A 0B 0C 1A

Also, en_US.utf8, sorts these two words incorrectly:
résume     
resumé

Comment 13 Andreas Schwab 2009-08-10 12:46:00 UTC

The rules say that latin diacritics are ordered backwards (except in the de_DE locale).

Comment 14 Srin Tuar 2009-08-10 13:28:33 UTC

Hello Andreas.

I'd like to know more about the rules you refer to. Based on the unicode standard, the sorting seems incorrect for both en_US and vi_VN. Certainly, for vietnamese the backwards sorting order for latin accents doesnt match dictionaries, common conventions, or the unicode standard. 

It seems a bit counter-intuitive to me to sort a string's accents RTL... if you have a link or an RFC number I could refer to, I'd love to know more about that.

Comment 15 Andreas Schwab 2009-08-10 13:51:21 UTC

ISO/IEC 14651:2001

Comment 16 Srin Tuar 2009-08-10 14:31:34 UTC

ISO/IEC 14651:2001
Related section, D.2, point 2
 It states a rule that only applies a subset of french dictionaries, and isnt universal (even for French). It should not apply to English, or most other languages. The Vietnamese and English collation results are clearly wrong, imo.

It looks like this is a bug. 

If someone knows a way to disable this behavior within a locale file, I'd be glad to use that to mitigate the problem temporarily. But the correct action would seem to be to fix the code that orders the diacriticals. (Perhaps ill peek through the strcoll to see if I can find it... )

Comment 17 Ulrich Drepper 2009-08-10 14:45:14 UTC

Almost nobody (including myself) understands collation.  That's something for librarians and language researchers.  Don't even think about changing anything for English and other supported languages.

If you think Vietnamese isn't handled correctly then by all means, change it.  But don't touch anything else.

And the strcoll/strxfrm code of course doesn't directly have anything to do with the rules.  These functions just interpret data generated by localedef.  All the logic is described by the locale files.  And if you'd have paid attention to what Andreas wrote you would have immediately been able to spot what has to be done to enable forward-handling of diacrits.  The de_DE locale contains

define DIACRIT_FORWARD

Comment 18 Srin Tuar 2009-08-10 15:43:52 UTC

Created attachment 356908 [details]
locale file

Thanks Ulrich. Yeah, a quick code review shows that, it looks like strcoll/strxfrm are pretty flexible.

I'll focus on vi_VN only then, since thats the part thats giving me trouble. 

I tried adding "define DIACRIT_FORWARD " and its mostly working.

I found a new problem: strcoll and strcmp(strxfrm) are providing different results. (strcmp(strxfrm) works as expected, but strcoll is giving the letter case higher precedence than accents, otherwise correct) 

Attached is the vi_VN locale file im working with now. I'll keep debugging.

Comment 19 Srin Tuar 2009-08-10 20:49:38 UTC

Pravin, Ive almost got it working.

I want "Hà Tiên" to sort before "hà tiện".
With the attached locale file, strxfrm is working correctly, but strcoll
is returning the capital string as more significant than the diacritial.

"Hà Tiên" strcoll "hà tiện" = 7 (after/greater)
strxfrm Hà Tiên =17 0C 25 18 14 1D 01 08 09 08 08 08 08 01 09 02 09 02 02 02 01 03 30
strxfrm hà tiện =17 0C 25 18 14 1D 01 08 09 08 08 0D 08 01 02 02 02 02 02 02 01 03 30

Do you see what I might be doing wrong? Any input/pointers would be appreciated.

Comment 20 Srin Tuar 2009-08-13 02:29:06 UTC

Created attachment 357266 [details]
better locale file/Tiếng Việt 

I figured out "Hà Tiên" strcoll "hà tiện": it was the misclassification of PCT vs BPT. I guess the main latin file had moved on while the viet locale was using the older style, and somehow the interaction messed up strcoll without affecting strxfrm.. mystifying but I'm moving on...

After implementing a version of the unicode standard sort algo for vi_VN, and also using a 3rd party version (libicu) and comparing to a dictionary, I have discovered several more deficiencies in the locale vs the standards... the only way to resolve this would be to not include iso14651_t1_common and define vi collation independently... (non-letters being moved to round 4 causes missorting)

I now understand why so many projects have moved to (the really slow) libicu and avoided using libc locales for this... but there are enough projects that rely upon native locales or this to be worth fixing, for me at least. 母語の文句は後回しに。

If anyone out there is still listening, I'd appreciate any insight/comments on this latest locale file.

Comment 21 Pravin Satpute 2009-08-13 09:51:52 UTC

tested locale file you attached with sort function 
i think now its giving result as per your expectation

test data:
ái
ai
ải
hà tiện
á
a
Hà Tiên
ả

o/p:
Hà Tiên
a
ai
hà tiện
á
ái
ả
ải

please test it with bit more data, and attach test_cases and test result as well here, so if someone interested in testing can test same

Comment 22 Srin Tuar 2009-08-13 15:43:48 UTC

Created attachment 357347 [details]
Unit Test Input to "sort"

Comment 23 Srin Tuar 2009-08-13 15:44:28 UTC

Created attachment 357349 [details]
expected sorted output of unit test

Comment 24 Srin Tuar 2009-08-13 15:46:46 UTC

Created attachment 357350 [details]
locale file, seems to be working

Pravin, Thank you! I have added unit test data as you have suggested. 

I have added some unit testing data, and a new locale file.
This locale file matches the output of libicu's collation algorithm, tested against every word in a vietnamese dictionary.

Comment 25 Pravin Satpute 2009-08-14 08:22:17 UTC

yeah, tested locale file with your testing data its working fine.
but IMO it will be nice, if you keep line
copy "iso14651_t1" and do other thing with reorder-after.

iso14651_t1_common file is like a universal collation file and contains collation info of all other scripts. including it helps in keeping collation data of other script also available by selecting vi_VN locale.

Comment 26 Srin Tuar 2009-08-14 12:48:44 UTC

Thanks Pravin.
I'd like that as well, but I'm suspicious that there may be no way to pass the unit test and still include that iso14651_t1_common. Vietnamese words are mostly compounds, and by ignoring whitespace until round 4, it becomes impossible to sort correctly.

Comment 27 Pravin Satpute 2009-08-17 05:44:10 UTC

in that case i think for quick fixing this bug we should go with your update on collation, and when someone will find fix w.r.t including iso* file as well we should go with that.

Ulrich what you says about this?

Comment 28 Ulrich Drepper 2009-10-14 07:01:45 UTC

Srin,

have you talked to the original author?  I'm not going to change anything without the author agreeing or being unresponsive.

Comment 29 Srin Tuar 2009-11-23 06:59:33 UTC

Ulrich, 
I havent spoken to the original authors: several of which have touched the locale file over time. Pablo's name is currently in the file itself: If you prefer that he weigh in, I'll forward him a link to here. The latest layout was done by someone else however...

Comment 30 Nguyen Thai Ngoc Duy 2010-02-24 15:27:15 UTC

(In reply to comment #29)
> Ulrich, 
> I havent spoken to the original authors: several of which have touched the
> locale file over time. Pablo's name is currently in the file itself: If you
> prefer that he weigh in, I'll forward him a link to here. The latest layout was
> done by someone else however...    

The last "significant" changes in collation part was from bug 448 [1]. The collation part was modified by Samuel Thibault in that bug. Were you referring him as "someone else"?

There are only a few people who touched the collation part, according to glibc.git:

 - Kentaroh Noji and Tetsuji Orita (original authors)
 - Pablo Saratxaga
 - Samuel Thibault

[1] http://sources.redhat.com/bugzilla/show_bug.cgi?id=448

Comment 31 Ulrich Drepper 2010-02-25 01:38:32 UTC

It is pretty clear that the currently attached locale file is not usable.  You cannot define  collation rules only for one language.  Using the common file ensures that all the languages are handled.

I can imagine the current iso14651_common file is not sufficient.  The solution is, though, not to replace the rules but change then using reorder-after etc.  There are plenty of examples in the source tree.

Pravin (already cc:ed) might be able to help you.

Comment 32 Pravin Satpute 2010-02-25 04:12:08 UTC

can someone attach a complete sort order required for vi_VN.UTF-8

It should include all characters required for vi_VN.UTF-8, and sorted it in expected order.

I will take a look at it

Comment 33 Bug Zapper 2010-04-28 09:40:21 UTC

This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 34 Bug Zapper 2010-06-28 14:03:21 UTC

Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.