Bug 514110
Summary: | [RFE] [ta_IN] Tamil collation rules are not working in other locales | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Pravin Satpute <psatpute> | ||||||||
Component: | glibc | Assignee: | Jeff Law <law> | ||||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | rawhide | CC: | ankit, drepper, fweimer, i18n-bugs, jakub, mnewsome, pnemade, santhosh.thottingal, skhome | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2012-10-17 17:17:26 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Pravin Satpute
2009-07-28 03:49:05 UTC
some discussion happened on mail already, thanks to santhosh for preparing patch to move ta_IN collation rules to iso14651_t1_common file /me corrected syntax errors as well tested with some test data please test it will some more tamil data so we can make some correction before moving to glibc steps for testing 1) download Tamil_Collation folder from http://pravins.fedorapeople.org/Tamil_Collation 2) cd Tamil_Collation 3) append some more rule in test file 4) LOCPATH=$PWD LC_ALL=ta_TT sort ../test > test_result 5) gedit test_result please add test file as well result in bugzilla if its correct i will proceed with patch it will be nice if some tamil person can test this patch once, so we can proceed. its two month now its in bugzilla This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle. Changing version to '12'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Created attachment 380616 [details]
sorting i am getting with my present patch
had a discussion with Felix, we need order like
க்
க
கா
கி
working on same
Created attachment 385655 [details]
This is present sorting order i am getting with my patch
please check whether this is proper order,
if yes i will prepare patch for glibc, else we can correct same
This is the correct order. You can go ahead. Created attachment 385662 [details]
patch for enhancing tamil collation
Thanks felix for confirmation
above patch will move tamil collation rules from ta_IN to iso_common file
it has some enhancement as well as compared to previous collation rules
* handled pure consonant properly
* now collation will be in iso_common file, so one can see sorted tamil characters. menus in any locale
Applied upstream. Pravin Excuse me for participating rather late. My apologies. In the attachment to your comment above in # Comment 5, I see the following [..] ஃ க் க கடந்த கா காலையில் கி கீ கு கூ கெ கே கேஎஸ் கேஎஸ்ரவிக்குமார் கை கொ கோ கௌ க்க க்கௗ க்ங ங் ங [..] To be correct க்க, க்கள, க்ங in the above should be in between க் and க and not as indicated above in which they are shown to be between கௌ and ங் For example if we sort { அ, அக், அகம் அக்கம் அக்கு, அகால } to be in ascending order, then the correct result should be {அ, அக், அக்கம், அக்கு, அகம், அகால } (அக் in above is of course a meaningless word - but for making a sorting order that has no bearing) The above sorting order from your attachment suggests it will sort to {அ,அக்,அகம், அகால, அக்கம், அக்கு } which is wrong. Compare with English alphabets in ascending order: a,b,c,d,e,f,.....x,y,z Now ask where should be aa, ab, ac, ad, ....,az be ? Obviously they are between a and b - like a,aa,ab,ac,ad,...,ax,ay,az,b,c,d,e... Right? Well similarly க் followed by any letter should be somewhere in between க் and க. i.e., if க் is followed by a character, sort order is as follows: க் < க்அ < க்ஆ < க்இ < க்ஈ < ..........க்ஒ < க்ஓ < க்ஔ < க்ஃ < க்க் < க்க < க்கா < க்கி ............< க்கொ < க்கோ < க்கௌ < க்ங் < க்ங < க்ஙா ........ (Wherever I indicate ....... they are to mean "and so on") In English sorting of words, if two words have same first letter then they are next to each other and their relative placement is determined by second letter in each - if they are also same then by the 3rd letter, and ad-nauseum . Essentially first sort on first letter, then wherever necessary on second letter and so on but for each of those sorting the order of sorting used is same. So the increasing order (or weight-ages) of characters is like a one dimensional vector set - the repeated application of which wherever needed in a collating of words. The principle is same for Tamil too but there are some differences. In case of English in Unicode each character is represented by a single code point. But in case of Tamil, I presume, the following differences could make collation algorithm conceptualisation to some degree more complex: i) Canonical Decompositions - in each case of ஔ (0B94), ொ (0BCA), ோ (0BCB), ௌ (0BCC), the NFC and the NFD forms should have equality. ii) Grantha ligatures SHRII , KSHA as well as KSHA+Virama - each of these of course are not represented by single code point but of 3 or 4 code point sequence. iii) While the 11 vowel modifiers from ா (0BBE) to ௌ (0BCC) can be placed as singletons in ascending order (with provision for the canonical decompositions of the last 3 - 0BCA, 0BCB and 0BCC being made equal to each of their composed form) in case of Pulli (i.e., Virama) - 0BCD) it has to be placed along with each base consonant immediately preceding corresponding based consonant - i.e., க் (0B95 0BCD) preceding க (0B95) and similarly for each base consonant. Hope I my observations so far are clear and correct ; if not I would like to know where I am wrong. I will write more after seeing responses K. Sethu (11:30 PM , Colombo-Sri Lanka) no props things happens, it will be nice if you can attach standard sorting list as per exception and some reference for that as well may be in next cycle we can fix that This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle. Changing version to '13'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping can you attach sort order you are expecting? i am clear that you want same sort order, we are following presently for Malayalam i.e cons + virama, should get sorted before cons cons+virama cons please check present malayalam sorting as well santhosh, can you provide patch for this? it looks same like you did for malayalam let me know Pravin, I think we need a consensus from the Tamil speakers here. I have explained my patch here: http://thottingal.in/blog/2011/02/26/tamil-collation-in-glibc/ If we get a clear definition, I would be happy to prepare the patch. Thanks Santhosh!! Yeah, taking some more comment from community is better before preparing patch K. Sethu, you there? Pravin Yes I am alive ;>) Just posted posted some comments on standards to Santosh's blog (gone to his moderation), but will write soon my opinions on the issues due from the rules designed to align with phonetic (or is it phonemic) decomposition of consonant+vowel syllables. Till then Regards K. Sethu This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. Hi K. Sethu, any update on this? sorting is bit tricky thing, so getting consensus from all might be difficult. At least we should stick to one standard. Without some kind of agreement on the sorting rules we can't move this issue forward. I'm marking this as NEEDINFO until we have that information. No response from K. Sethu in over a year. Closing with insufficient data. This can be reopened and addressed in the upstream sources if K. Sethu or someone else can provide us with the right rules for collation in this locale. |