460806 – [bn_IN] Bengali collation/sorting order is not defined in glibc

Bug 460806 - [bn_IN] Bengali collation/sorting order is not defined in glibc

Summary: [bn_IN] Bengali collation/sorting order is not defined in glibc

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	10
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Pravin Satpute
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-09-01 10:35 UTC by Pravin Satpute
Modified:	2009-05-05 07:29 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-05-05 07:29:47 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
It will add bengali collation tables in iso14651_t1_common file (7.86 KB, patch) 2009-04-23 06:17 UTC, Pravin Satpute	no flags	Details \| Diff
View All

Description Pravin Satpute 2008-09-01 10:35:38 UTC

Description of problem:
Bengali sort order is not defined in /usr/share/i18n/locales/iso14651_t1_common.


Version-Release number of selected component (if applicable):
glibc-common-2.8.90-11

How reproducible:
every time

Steps to Reproduce:
1.check file /usr/share/i18n/locales/iso14651_t1_common
2.
3.
  
Actual results:
Bengali sort order should be there

Expected results:


Additional info:

Comment 1 Tony Fu 2008-09-10 03:07:05 UTC

requested by Jens Petersen (#27995)

Comment 2 Ulrich Drepper 2008-09-15 22:52:20 UTC

This bug should not be assigned to Jakub.  Is there an email address for the i18n team?

Comment 3 Pravin Satpute 2008-09-16 06:36:53 UTC

i18n team following fedora-i18n-bugs list

Comment 4 Pravin Satpute 2008-10-10 05:57:08 UTC

Runa, Sayamindu
as per discussion we had on mailing, we can go for collation sequence
used by

a) Sansad
b) Anandabazar

can you provide me these sequences? 
so i can start working on it

Comment 5 Bug Zapper 2008-11-26 02:55:07 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 6 Runa Bhattacharjee 2009-03-06 10:15:56 UTC

Hello,

For the bn-IN collation order, please consider the sequence as listed under the column "Glibc" in this page:

http://ankur.org.in/wiki/CollationSequence

Thanks
Runa

Comment 7 Pravin Satpute 2009-03-13 06:04:43 UTC

I have written collation tables for bengali,

available at http://pravins.fedorapeople.org/bn_IN.tar.gz

steps for quick testing 

1) untar file
in terminal
2) cd bn_IN
3) run commnad
LOCPATH=$PWD LC_ALL=bn_TT sort test_file >  sorted_output

test_file -> unsorted data
sorted_output -> sorted output

5) i have added some entries in test_file for testing 
one can edit it and add more words for testing

Note: following characters are not available in http://ankur.org.in/wiki/CollationSequence

i have added those also in collation table, please check those also

U+09E0
U+098c
U+09E1
U+09D7 
U+09E1
U+09BD
U+09C3
U+09C4

please let me know yours comments

Comment 8 Runa Bhattacharjee 2009-04-09 07:25:48 UTC

Hello,

Thanks Pravin for getting this in place. A few notes follow here:

1. The collation sequence currently sorts as:

[...] ও ঔ ক [...] হ া ি [...] ো ৌ ং ঃ ঁ [...] 

Comment: ং, ঃ and ঁ ought to sort as [...] ঔ  ং  ঃ  ঁ ক [...] and not after  ৌ

Ref: http://ankur.org.in/wiki/CollationSequence

Can you please check if we are missing something here?

2. The additional characters added:

U+09E0
U+098c
U+09E1
U+09D7 
U+09E1 (duplicate reference)
U+09BD
U+09C3 (already present in our sequence)
U+09C4

The one missing from the list is: U+09BC

These additional characters have been put as sorted currently by your patch, under the column titled "Glibc (Proposed)" in the page: 
http://ankur.org.in/wiki/CollationSequence

Thanks a lot.

Comment 9 Jamil Ahmed 2009-04-13 20:54:57 UTC

We are also working on this issue. You can download the *draft* locale files (for bn_BD) from [0]. There are two files; one for normal sequence (the way we taught in our childhood :-) and the second one for dictionary sequence [1].

As the Bengali (bn) Collation sequence will be common for both Bangladesh (bn_BD) and India (bn_IN), we will work together.

[0] http://www.ankur.org.bd/downloads/glibc_collation/
[1] http://www.ankur.org.bd/wiki/Documentation#About_Bangla_Characters

Comment 10 Pravin Satpute 2009-04-14 11:43:42 UTC

(In reply to comment #9)
> We are also working on this issue. You can download the *draft* locale files
> (for bn_BD) from [0]. There are two files; one for normal sequence (the way we
> taught in our childhood :-) and the second one for dictionary sequence [1].

but i think we can implement only one in glibc.
 
> As the Bengali (bn) Collation sequence will be common for both Bangladesh
> (bn_BD) and India (bn_IN), we will work together.

Thats nice
Now i think we can do more accurate work for Bengali collation.

It will be if you see once /usr/share/i18n/locales/iso14651_t1_common, this is common file for collation data and included in almost all locale file in LC_COLLATE section, 
We should also add Bengali collation data in this file only.
Advantage of keeping collation data in this file is that, your collation data will be available to OS even if one select other locale. 
I saw your file, nice work, but it will be nice if you follow iso14651_t1_common file structure.

check my earlier comments, i have done some work for Bengali and it is in final stage, it will be nice if you check it once and suggest any modification if require.

It will be nice if Bengali India and Bangladesh collation orders are 100% same, but even if there are any difference it will not create any problem as both have different locale file and we can update differences in respective locale file. so it will not affect on and other. 
(for example, Devanagari script is share by Marathi, Hindi, Sanskrit language's though there basic sorting is in common file and some miscellaneous sorting stuff in respective locale file check mr_IN  reorder)

Comment 11 Ulrich Drepper 2009-04-14 14:09:07 UTC

(In reply to comment #10)
> but i think we can implement only one in glibc.

Yes.  And even if it first astonished people, we implement the vorrect dictionary/library sorting in Western locales.

Comment 12 Pravin Satpute 2009-04-15 08:49:19 UTC

aha,
Thanks for confirming

updated collation table as per comment #8, 
please test it once, see comment #7 for testing 
let me know if any modification required, will prepare patch then

Comment 13 Jamil Ahmed 2009-04-15 09:22:34 UTC

Pravin, we will test and let you know soon.

Comment 14 Md. Rezwan Shahid 2009-04-16 10:10:42 UTC

Hello Pravin,
I have tested your collation rules for Bengali and seems its working great. I think you can prepare the patch for it now :) .

Comment 15 Runa Bhattacharjee 2009-04-22 11:16:43 UTC

The result of the tests seem to be in order with the bn-IN proposal. We are ok to go ahead with this sequence in both:

/usr/share/i18n/locales/iso14651_t1_common, and
bn_IN locale file

Pravin, your current tarball contains only one bn file. Since we have two separate locale files for bn-IN and bn-BD would we require further testing with the actual locale files?

regards
Runa

Comment 16 Pravin Satpute 2009-04-23 06:15:48 UTC

(In reply to comment #15)
> The result of the tests seem to be in order with the bn-IN proposal. We are ok
> to go ahead with this sequence in both:
> 
> /usr/share/i18n/locales/iso14651_t1_common, and
> bn_IN locale file

Thanks to both of you for testing and your comments

> 
> Pravin, your current tarball contains only one bn file. Since we have two
> separate locale files for bn-IN and bn-BD would we require further testing with
> the actual locale files?

no, not require as we are not changing anything in locale file itself, so results will be same for both locale

Just I am removing '-% TODO: Bengali sorting should be added' line from bn_BD locale as sorting will be there now :)

Attaching patch 

Ulrich it will be nice if you look at it once

Comment 17 Pravin Satpute 2009-04-23 06:17:33 UTC

Created attachment 340873 [details]
It will add bengali collation tables in iso14651_t1_common file

Comment 18 Pravin Satpute 2009-05-05 07:29:47 UTC

patch is now available in glibc upstream,
Bengali collation will be available with next upstream release

Thanks to all for information and help.

Note You need to log in before you can comment on or make changes to this bug.