466912 – [bn_IN] [bn] [as_IN] iswpunct, iswdigit, iswalpha returns wrong values for Bangla (Bengali) characters

Bug 466912 - [bn_IN] [bn] [as_IN] iswpunct, iswdigit, iswalpha returns wrong values for Bangla (Bengali) characters

Summary: [bn_IN] [bn] [as_IN] iswpunct, iswdigit, iswalpha returns wrong values for Ba...

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Pravin Satpute
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (8):	473888 473898 474105 474107 474117 474119 474124 474127 (view as bug list)
Depends On:
Blocks:	F11Alpha, F11AlphaBlocker 480885
TreeView+	depends on / blocked

Reported:	2008-10-14 14:02 UTC by mandar.mitra
Modified:	2009-02-09 05:25 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-02-09 05:25:44 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Short program to demonstrate the bug. (1.14 KB, text/plain) 2008-10-14 14:06 UTC, mandar.mitra	no flags	Details
patch created against glibc devel branch (100.90 KB, patch) 2008-12-18 12:47 UTC, Pravin Satpute	no flags	Details \| Diff
Patch created against glibc devel branch (14.69 KB, patch) 2008-12-22 11:55 UTC, Pravin Satpute	no flags	Details \| Diff
Will solve circular dependencies between locale definitions mr_IN, hi_IN, bn_BD, bn_IN and as_IN (3.94 KB, patch) 2009-01-28 07:04 UTC, Pravin Satpute	no flags	Details \| Diff
Show Obsolete (1) View All

Description mandar.mitra 2008-10-14 14:02:35 UTC

Description of problem:
iswalpha(), iswdigit(), and iswpunct() return incorrect values for some Bangla (Bengali) characters. Specifically, iswalpha returns false and iswpunct returns true for 0981 (CANDRABINDU), 0982 (ANUSVARA), 0983 (VISARGA), and vowel modifiers (09BE, 09BF, etc.). iswdigit does not return true for the Bangla digits (09E6 through 09EF)


Version-Release number of selected component (if applicable):
glibc-2.7-2


How reproducible:


Steps to Reproduce:
Compile and run the attached program (bn_ctypes.c). Output should be self-explanatory.
  

Actual results:


Expected results:


Additional info:

Comment 1 mandar.mitra 2008-10-14 14:06:51 UTC

Created attachment 320304 [details]
Short program to demonstrate the bug.

Comment 2 Pravin Satpute 2008-10-24 08:44:51 UTC

Gone through the output
yeah, It is not giving right result for Bengali Matras, Vowel Modifiers as well as Digit's

since there are only following types 
LOWER	UPPER	ALPHA	DIGIT	ALNUM	PUNCT	GRAPH

for Vowel Modifier U0981, U0982 U0981 and Matras U09BC to U09D7 should come under ALPHA category only
and Digit (09E6 through 09EF) should come under digit category

Will check for other Indic script while resolving this bug 

Mandar what you says?

Comment 3 mandar.mitra 2008-10-24 16:08:08 UTC

Sounds great.

Minor clarification 1: I think everything that is marked as punctuation in the range U09bc to U09e3 should actually be alpha.

Minor clarification 2: they should also come under ALNUM, GRAPH and PRINT categories, but maybe this is automatically ensured by the implementation.

Irrelevant point: I don't think U0981, U0982, U0983 are really vowel modifiers. In school, we'd learnt that these were consonants of the ayogbaha (অযোগবাহ) type.

Comment 4 Pravin Satpute 2008-10-31 10:22:48 UTC

(In reply to comment #3)
> Sounds great.
> 
> Minor clarification 1: I think everything that is marked as punctuation in the
> range U09bc to U09e3 should actually be alpha.
> 
> Minor clarification 2: they should also come under ALNUM, GRAPH and PRINT
> categories, but maybe this is automatically ensured by the implementation.
> 

yeah, we will do this thing :)
I will confirm with Ulrich, where to apply fix exactly 

> Irrelevant point: I don't think U0981, U0982, U0983 are really vowel modifiers.
> In school, we'd learnt that these were consonants of the ayogbaha (অযোগবাহ)
> type.

After a consonant, vowel or Matra character, a character can be used which modifies the vowel sound and is called a "Vowel Modifier". This can be a Chandrabindu, Anuswar or Visarg.

BIS for ISCII has defined this.

Comment 5 Bug Zapper 2008-11-26 11:14:19 UTC

This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 6 Santhosh Thottingal 2008-11-30 11:39:21 UTC

This bug is present in Malayalam(ml_IN) too. Matra signs are not recognized as letters

Comment 7 Pravin Satpute 2008-12-01 10:18:51 UTC

yeah, even i found same is present in all Indic script. I will file bug against all languages where respective language expert can give there suggestion

Comment 8 Jens Petersen 2008-12-17 04:49:59 UTC

moving to rawhide

Comment 9 Pravin Satpute 2008-12-17 06:32:39 UTC

% The "digit" class must only contain the BASIC LATIN digits, says ISO C 99
% (sections 7.25.2.1.5 and 5.2.1)

because of this we cant add our indic digit in digit class, but we can add indic digits in outdigit category as done in fa_IR

http://www2.open-std.org/jtc1/SC22/WG20/docs/n608.txt

b) outdigit - why is this required and how is it different from
"digit"

"digit" classifies all digits, while "outdigit" classifies the
values used for outputting. This keyword is needed to determine
the characters used for outputting.

Comment 10 Ulrich Drepper 2008-12-18 00:11:50 UTC

(In reply to comment #9)
> b) outdigit - why is this required and how is it different from
> "digit"
> 
> "digit" classifies all digits, while "outdigit" classifies the
> values used for outputting. This keyword is needed to determine
> the characters used for outputting.

Just do it as fa_IR locale does it.  It's the correct way:

- define the digits usiing outdigits

- provide an input mapping using the new map "to_inpunct"

- provide additional information like thousands separators and decimal points
  using the "to_outpunct" map

Why that you'll automatically get support for input and output of the additional number format.

Comment 11 Pravin Satpute 2008-12-18 12:47:24 UTC

Created attachment 327320 [details]
patch created against glibc devel branch

This patch will add modified i18n CTYPE in bn_IN, as i18n CTYPE has Indic matra characters in punct group, it will solve that problem

Comment 12 Pravin Satpute 2008-12-18 12:53:06 UTC

unlike fa_IR , Bengali script doesn't have separate thousands separators and decimal points (uses latin one only)

if no issues with this, i will add same 'bn_IN CTYPE' in other indic locale also (with outdigit)

Comment 13 Runa Bhattacharjee 2008-12-18 13:11:20 UTC

Adding Jamil Ahmed, the Fedora bn coordinator to the bug and modified the summary. The bn_BD locale file would also need to be modified in this case.

Comment 14 Ulrich Drepper 2008-12-18 15:03:17 UTC

(In reply to comment #11)
> This patch will add modified i18n CTYPE in bn_IN, as i18n CTYPE has Indic matra
> characters in punct group, it will solve that problem

This is wrong.  Don't copy i18n.  You'll have to use reorder_after.

Comment 15 Ulrich Drepper 2008-12-19 16:12:43 UTC

Oops, it's LC_CTYPE, forget what I wrote.

If the changes affect all languages using the values you're changing, then just change the i18n file.  There is no reason to assume this file is correct for non-Latin languages.

Comment 16 Pravin Satpute 2008-12-22 11:55:40 UTC

Created attachment 327634 [details]
Patch created against glibc devel branch

This patch will modify i18n CTYPE, as i18n CTYPE has Indic matra
characters in punct group, it will solve that problem, also it will add outdigit class in all indic locale

Comment 17 Ulrich Drepper 2008-12-30 16:50:31 UTC

I've applied the patch to the upstream cvs.  It'll be in the next rawhide build.  This means all these similar BZs opened can then be closed, right?

Comment 18 Pravin Satpute 2008-12-31 05:34:37 UTC

(In reply to comment #17)
> I've applied the patch to the upstream cvs.  It'll be in the next rawhide
> build.  This means all these similar BZs opened can then be closed, right?

Yes :)

Comment 19 Pravin Satpute 2008-12-31 05:40:07 UTC

*** Bug 473888 has been marked as a duplicate of this bug. ***

Comment 20 Pravin Satpute 2008-12-31 05:41:10 UTC

*** Bug 473898 has been marked as a duplicate of this bug. ***

Comment 21 Pravin Satpute 2008-12-31 05:43:02 UTC

*** Bug 474105 has been marked as a duplicate of this bug. ***

Comment 22 Pravin Satpute 2008-12-31 05:43:47 UTC

*** Bug 474107 has been marked as a duplicate of this bug. ***

Comment 23 Pravin Satpute 2008-12-31 05:44:45 UTC

*** Bug 474117 has been marked as a duplicate of this bug. ***

Comment 24 Pravin Satpute 2008-12-31 05:45:54 UTC

*** Bug 474119 has been marked as a duplicate of this bug. ***

Comment 25 Pravin Satpute 2008-12-31 05:47:31 UTC

*** Bug 474124 has been marked as a duplicate of this bug. ***

Comment 26 Pravin Satpute 2008-12-31 05:48:57 UTC

*** Bug 474127 has been marked as a duplicate of this bug. ***

Comment 27 Pravin Satpute 2008-12-31 06:00:24 UTC

Thanks for making patch upstream.

feel free to reopen if any problem.

Comment 28 Parag Nemade 2009-01-28 04:39:40 UTC

Pravin, I looked into patch and found that changes that were proposed in patch causes those locales not to work with latest glibc build.
I found following locales not working with glibc-2.9.90-2.i386
as_IN
bn_BD
hi_IN
hne_IN
mai_IN

Actually, above patch made localedef to report following error
"circular dependencies between locale definitions"

We need to fix this as soon as possible as F11 Alpha already got 5 locales missing now. Can you also look into this issue?

Comment 29 Pravin Satpute 2009-01-28 07:04:23 UTC

Created attachment 330199 [details]
Will solve circular dependencies between locale definitions  mr_IN, hi_IN, bn_BD, bn_IN and as_IN

Comment 30 Pravin Satpute 2009-01-28 07:07:29 UTC

Parag, 
thanks for notifying this bug
it created big problem, above patch will solve this problem

Comment 31 Ulrich Drepper 2009-01-28 15:51:11 UTC

I applied the patch upstream.

Note You need to log in before you can comment on or make changes to this bug.