Bug 514110 - [RFE] [ta_IN] Tamil collation rules are not working in other locales
Summary: [RFE] [ta_IN] Tamil collation rules are not working in other locales
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: rawhide
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Jeff Law
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-07-28 03:49 UTC by Pravin Satpute
Modified: 2016-11-24 16:12 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-10-17 17:17:26 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
sorting i am getting with my present patch (1.60 KB, text/plain)
2009-12-28 11:11 UTC, Pravin Satpute
no flags Details
This is present sorting order i am getting with my patch (4.46 KB, text/plain)
2010-01-20 11:16 UTC, Pravin Satpute
no flags Details
patch for enhancing tamil collation (11.33 KB, patch)
2010-01-20 12:21 UTC, Pravin Satpute
no flags Details | Diff

Description Pravin Satpute 2009-07-28 03:49:05 UTC
Description of problem:
ta_IN collation rules are not working, when we select other locale say en_US.UTF-8

Version-Release number of selected component (if applicable):
glibc-common-2.10.90-7.1

How reproducible:
every time

Steps to Reproduce:
1. select en_US locale 
2. try to sort Tamil characters
3.
  
Actual results:
sorting is not working

Expected results:
It should be sorted as per collation rule

Additional info:
This is happening due to ta_IN collation rules written in ta_IN locale file, and these rules are not available to outside locale.

We should move these collation table to iso14651_t1_common(common to most of the locale), so it will be available to other locale as well and tamil sorting will work while selecting any locale

Comment 1 Pravin Satpute 2009-08-21 10:26:29 UTC
some discussion happened on mail already, thanks to santhosh for preparing patch to move ta_IN collation rules to iso14651_t1_common file

/me corrected syntax errors as well tested with some test data

please test it will some more tamil data so we can make some correction before moving to glibc 

steps for testing
1) download Tamil_Collation folder from http://pravins.fedorapeople.org/Tamil_Collation

2) cd Tamil_Collation

3) append some more rule in test file

4) LOCPATH=$PWD LC_ALL=ta_TT sort ../test > test_result

5) gedit test_result

please add test file as well result in bugzilla

if its correct i will proceed with patch

Comment 2 Pravin Satpute 2009-10-24 06:10:27 UTC
it will be nice if some tamil person can test this patch once, so we can proceed.

its two month now its in bugzilla

Comment 3 Bug Zapper 2009-11-16 11:08:29 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 4 Pravin Satpute 2009-12-28 11:11:23 UTC
Created attachment 380616 [details]
sorting i am getting with my present patch 

had a discussion with Felix, we need order like
க்
க
கா
கி

working on same

Comment 5 Pravin Satpute 2010-01-20 11:16:48 UTC
Created attachment 385655 [details]
This is present sorting order i am getting with my patch

please check whether this is proper order,
if yes i will prepare patch for glibc, else we can correct same

Comment 6 Felix 2010-01-20 11:50:06 UTC
This is the correct order. You can go ahead.

Comment 7 Pravin Satpute 2010-01-20 12:21:32 UTC
Created attachment 385662 [details]
patch for enhancing tamil collation

Thanks felix for confirmation
above patch will move tamil collation rules from ta_IN to iso_common file

it has some enhancement as well as compared to previous collation rules
* handled pure consonant properly
* now collation will be in iso_common file, so one can see sorted tamil characters. menus in any locale

Comment 8 Ulrich Drepper 2010-02-03 11:32:30 UTC
Applied upstream.

Comment 9 K. Sethu 2010-02-04 18:07:34 UTC
Pravin

Excuse me for participating rather late. My apologies.

In the attachment to your comment above in # Comment 5, I see the following

[..]
ஃ
க்
க
கடந்த
கா
காலையில்
கி
கீ
கு
கூ
கெ
கே
கேஎஸ்
கேஎஸ்ரவிக்குமார்
கை
கொ
கோ
கௌ
க்க
க்கௗ
க்ங
ங்
ங
[..]

To be correct க்க, க்கள, க்ங in the above should be in between க் and க  and not as indicated above in which they are shown to be between கௌ and ங்  

For example if we sort { அ, அக், அகம் அக்கம் அக்கு, அகால } to be in ascending order, then the correct result should be  {அ, அக், அக்கம், அக்கு, அகம், அகால }

(அக் in above is of course a meaningless word - but for making a sorting order that has no bearing) 

The above  sorting order from your attachment suggests it will sort to {அ,அக்,அகம், அகால, அக்கம், அக்கு } which is wrong.

Compare with English alphabets in ascending order:

a,b,c,d,e,f,.....x,y,z

Now ask where should be aa, ab, ac, ad, ....,az be ? Obviously they are between
a and b - like a,aa,ab,ac,ad,...,ax,ay,az,b,c,d,e... Right?


Well similarly க் followed by any letter should be somewhere in between க் and  க. i.e., if க் is followed by a character,  sort order is as follows:

க் < க்அ < க்ஆ < க்இ < க்ஈ < ..........க்ஒ < க்ஓ < க்ஔ < க்ஃ < க்க் < க்க < க்கா < க்கி ............< க்கொ < க்கோ < க்கௌ < க்ங் < க்ங < க்ஙா ........

(Wherever I indicate ....... they are to mean "and so on")


In English sorting of words, if two words have same first letter then they are next to each other and their relative placement is determined by second letter in each - if they are also same then by the 3rd letter, and ad-nauseum . 

Essentially first sort on first letter, then wherever necessary on second letter and so on but for each of those sorting the order of sorting used is same. So the increasing order (or weight-ages) of characters is like a one dimensional vector set - the repeated application of which wherever needed in a collating of words.

The principle is same for Tamil too but there are some differences. In case of English in Unicode each character is represented by a single code point. 

But in case of Tamil, I presume,  the following differences  could make collation algorithm conceptualisation to some degree more complex:

i) Canonical Decompositions - in each case of ஔ (0B94), ொ (0BCA), ோ (0BCB), ௌ (0BCC), the NFC and the NFD forms should have equality.

ii) Grantha ligatures SHRII , KSHA as well as KSHA+Virama - each of these  of course are not represented by single code point but of 3 or 4 code point sequence.

iii) While the 11 vowel  modifiers from ா (0BBE) to ௌ (0BCC) can be placed as singletons in ascending order (with provision for the canonical decompositions of the last 3 - 0BCA, 0BCB and 0BCC being made equal to each of their composed form) in case of Pulli (i.e., Virama) - 0BCD) it has to be placed along with each base consonant immediately preceding corresponding based consonant - i.e., க் (0B95 0BCD) preceding க (0B95) and similarly for each base consonant.

Hope I my observations so far are clear and correct ; if not I would like to know where I am wrong. I will write more after seeing responses

K. Sethu
(11:30 PM , Colombo-Sri Lanka)

Comment 10 Pravin Satpute 2010-02-05 09:09:05 UTC
no props things happens, 
it will be nice if you can attach standard sorting list as per exception and some reference for that as well

may be in next cycle we can fix that

Comment 11 Bug Zapper 2010-03-15 12:43:50 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle.
Changing version to '13'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 12 Pravin Satpute 2010-07-22 10:57:17 UTC
can you attach sort order you are expecting?
i am clear that you want same sort order, we are following presently for Malayalam

i.e cons + virama, should get sorted before cons

cons+virama
cons

please check present malayalam sorting as well

Comment 13 Pravin Satpute 2011-02-24 10:31:22 UTC
santhosh, 

can you provide patch for this? it looks same like you did for malayalam
let me know

Comment 14 Santhosh Thottingal 2011-02-26 12:46:49 UTC
Pravin,
I think we need a consensus from the Tamil speakers here. I have explained my patch here: http://thottingal.in/blog/2011/02/26/tamil-collation-in-glibc/ If we get a clear definition, I would be happy to prepare the patch.

Comment 15 Pravin Satpute 2011-02-28 06:59:21 UTC
Thanks Santhosh!!
Yeah, taking some more comment from community is better before preparing patch 

K. Sethu, you there?

Comment 16 K. Sethu 2011-02-28 09:26:32 UTC
Pravin

Yes I am alive ;>)

Just posted posted some comments on standards to Santosh's blog (gone to his moderation), but will write soon my opinions on the issues due from the rules designed to align with phonetic (or is it phonemic) decomposition of consonant+vowel syllables.

Till then 
Regards

K. Sethu

Comment 17 Fedora Admin XMLRPC Client 2011-11-14 19:14:45 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 18 Fedora Admin XMLRPC Client 2011-11-15 04:46:52 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 19 Pravin Satpute 2012-01-31 08:34:20 UTC
Hi K. Sethu,

   any update on this?
   sorting is bit tricky thing, so getting consensus from all might be difficult. At least we should stick to one standard.

Comment 20 Jeff Law 2012-02-10 06:11:41 UTC
Without some kind of agreement on the sorting rules we can't move this issue forward.    I'm marking this as NEEDINFO until we have that information.

Comment 21 Jeff Law 2012-10-17 17:17:26 UTC
No response from K. Sethu in over a year.  Closing with insufficient data.  This can be reopened and addressed in the upstream sources if K. Sethu or someone else can provide us with the right rules for collation in this locale.


Note You need to log in before you can comment on or make changes to this bug.