Description of problem: ta_IN collation rules are not working, when we select other locale say en_US.UTF-8 Version-Release number of selected component (if applicable): glibc-common-2.10.90-7.1 How reproducible: every time Steps to Reproduce: 1. select en_US locale 2. try to sort Tamil characters 3. Actual results: sorting is not working Expected results: It should be sorted as per collation rule Additional info: This is happening due to ta_IN collation rules written in ta_IN locale file, and these rules are not available to outside locale. We should move these collation table to iso14651_t1_common(common to most of the locale), so it will be available to other locale as well and tamil sorting will work while selecting any locale
some discussion happened on mail already, thanks to santhosh for preparing patch to move ta_IN collation rules to iso14651_t1_common file /me corrected syntax errors as well tested with some test data please test it will some more tamil data so we can make some correction before moving to glibc steps for testing 1) download Tamil_Collation folder from http://pravins.fedorapeople.org/Tamil_Collation 2) cd Tamil_Collation 3) append some more rule in test file 4) LOCPATH=$PWD LC_ALL=ta_TT sort ../test > test_result 5) gedit test_result please add test file as well result in bugzilla if its correct i will proceed with patch
it will be nice if some tamil person can test this patch once, so we can proceed. its two month now its in bugzilla
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle. Changing version to '12'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Created attachment 380616 [details] sorting i am getting with my present patch had a discussion with Felix, we need order like க் க கா கி working on same
Created attachment 385655 [details] This is present sorting order i am getting with my patch please check whether this is proper order, if yes i will prepare patch for glibc, else we can correct same
This is the correct order. You can go ahead.
Created attachment 385662 [details] patch for enhancing tamil collation Thanks felix for confirmation above patch will move tamil collation rules from ta_IN to iso_common file it has some enhancement as well as compared to previous collation rules * handled pure consonant properly * now collation will be in iso_common file, so one can see sorted tamil characters. menus in any locale
Applied upstream.
Pravin Excuse me for participating rather late. My apologies. In the attachment to your comment above in # Comment 5, I see the following [..] ஃ க் க கடந்த கா காலையில் கி கீ கு கூ கெ கே கேஎஸ் கேஎஸ்ரவிக்குமார் கை கொ கோ கௌ க்க க்கௗ க்ங ங் ங [..] To be correct க்க, க்கள, க்ங in the above should be in between க் and க and not as indicated above in which they are shown to be between கௌ and ங் For example if we sort { அ, அக், அகம் அக்கம் அக்கு, அகால } to be in ascending order, then the correct result should be {அ, அக், அக்கம், அக்கு, அகம், அகால } (அக் in above is of course a meaningless word - but for making a sorting order that has no bearing) The above sorting order from your attachment suggests it will sort to {அ,அக்,அகம், அகால, அக்கம், அக்கு } which is wrong. Compare with English alphabets in ascending order: a,b,c,d,e,f,.....x,y,z Now ask where should be aa, ab, ac, ad, ....,az be ? Obviously they are between a and b - like a,aa,ab,ac,ad,...,ax,ay,az,b,c,d,e... Right? Well similarly க் followed by any letter should be somewhere in between க் and க. i.e., if க் is followed by a character, sort order is as follows: க் < க்அ < க்ஆ < க்இ < க்ஈ < ..........க்ஒ < க்ஓ < க்ஔ < க்ஃ < க்க் < க்க < க்கா < க்கி ............< க்கொ < க்கோ < க்கௌ < க்ங் < க்ங < க்ஙா ........ (Wherever I indicate ....... they are to mean "and so on") In English sorting of words, if two words have same first letter then they are next to each other and their relative placement is determined by second letter in each - if they are also same then by the 3rd letter, and ad-nauseum . Essentially first sort on first letter, then wherever necessary on second letter and so on but for each of those sorting the order of sorting used is same. So the increasing order (or weight-ages) of characters is like a one dimensional vector set - the repeated application of which wherever needed in a collating of words. The principle is same for Tamil too but there are some differences. In case of English in Unicode each character is represented by a single code point. But in case of Tamil, I presume, the following differences could make collation algorithm conceptualisation to some degree more complex: i) Canonical Decompositions - in each case of ஔ (0B94), ொ (0BCA), ோ (0BCB), ௌ (0BCC), the NFC and the NFD forms should have equality. ii) Grantha ligatures SHRII , KSHA as well as KSHA+Virama - each of these of course are not represented by single code point but of 3 or 4 code point sequence. iii) While the 11 vowel modifiers from ா (0BBE) to ௌ (0BCC) can be placed as singletons in ascending order (with provision for the canonical decompositions of the last 3 - 0BCA, 0BCB and 0BCC being made equal to each of their composed form) in case of Pulli (i.e., Virama) - 0BCD) it has to be placed along with each base consonant immediately preceding corresponding based consonant - i.e., க் (0B95 0BCD) preceding க (0B95) and similarly for each base consonant. Hope I my observations so far are clear and correct ; if not I would like to know where I am wrong. I will write more after seeing responses K. Sethu (11:30 PM , Colombo-Sri Lanka)
no props things happens, it will be nice if you can attach standard sorting list as per exception and some reference for that as well may be in next cycle we can fix that
This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle. Changing version to '13'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
can you attach sort order you are expecting? i am clear that you want same sort order, we are following presently for Malayalam i.e cons + virama, should get sorted before cons cons+virama cons please check present malayalam sorting as well
santhosh, can you provide patch for this? it looks same like you did for malayalam let me know
Pravin, I think we need a consensus from the Tamil speakers here. I have explained my patch here: http://thottingal.in/blog/2011/02/26/tamil-collation-in-glibc/ If we get a clear definition, I would be happy to prepare the patch.
Thanks Santhosh!! Yeah, taking some more comment from community is better before preparing patch K. Sethu, you there?
Pravin Yes I am alive ;>) Just posted posted some comments on standards to Santosh's blog (gone to his moderation), but will write soon my opinions on the issues due from the rules designed to align with phonetic (or is it phonemic) decomposition of consonant+vowel syllables. Till then Regards K. Sethu
This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component.
Hi K. Sethu, any update on this? sorting is bit tricky thing, so getting consensus from all might be difficult. At least we should stick to one standard.
Without some kind of agreement on the sorting rules we can't move this issue forward. I'm marking this as NEEDINFO until we have that information.
No response from K. Sethu in over a year. Closing with insufficient data. This can be reopened and addressed in the upstream sources if K. Sethu or someone else can provide us with the right rules for collation in this locale.