240696 – charset ISO8859-8 in affix file unknown to hunspell

Bug 240696 - charset ISO8859-8 in affix file unknown to hunspell

Summary: charset ISO8859-8 in affix file unknown to hunspell

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	hunspell
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Caolan McNamara
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-05-20 16:25 UTC by Dan Kenigsberg
Modified:	2007-11-30 22:12 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-06-19 15:15:13 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
change charset name to something recognized by hunspell (333 bytes, patch) 2007-05-20 16:25 UTC, Dan Kenigsberg	no flags	Details \| Diff
patch to move the hunspell-he dictionaries into hspell package (1.61 KB, patch) 2007-05-21 09:04 UTC, Caolan McNamara	no flags	Details \| Diff
few correct (and two incorrect) words in utf-8 encoded Hebrew (142 bytes, application/octet-stream) 2007-05-21 16:28 UTC, Dan Kenigsberg	no flags	Details
View All

Description Dan Kenigsberg 2007-05-20 16:25:09 UTC

Description of problem:
Hunspell does not recognize Hebrew input

Version-Release number of selected component (if applicable):
hunspell-he-0.20050112-1.fc7

How reproducible: Always

Steps to Reproduce:
1. run hunspell -d he_IL in he_IL locale.
2. type Hebrew words.
  
Actual results:
Hebrew characters are ignored, as if they where whitespace.

Expected results:
Correct words accepted, misspelled words rejected.

Additional info:
Hunspell seems unfamiliar with the name ISO8859-8 of the character set. When the
affix file is changed to have ISO8859-8-I the problem disappears. (and it is,
puritanly speaking, the correct charset of the affix file).

Comment 1 Dan Kenigsberg 2007-05-20 16:25:10 UTC

Created attachment 155062 [details]
change charset name to something recognized by hunspell

Comment 2 Dan Kenigsberg 2007-05-20 16:44:57 UTC

Another problem, is that Hunspell spellchecks only in the stated character set.
This means that current .aff and .dic files are useless for someone using UTF-8
locale (which may well be a majority).

To make hunspell useful in Unicode evironment, one should:
1. change that SET line to SET UTF-8
2. convert the data files to UTF-8, with something like:
$ iconv -f hebrew -t utf8 </tmp/he_IL.aff >he_IL.aff
$ iconv -f hebrew -t utf8 </tmp/he_IL.dic >he_IL.dic

This renders Hunspell useless for 8bit Hebrew, and expands the data size almost
twofold.

Comment 3 Caolan McNamara 2007-05-21 09:04:27 UTC

Created attachment 155078 [details]
patch to move the hunspell-he dictionaries into hspell package

Looking through the history of where these dictionaries come from I see that
they are from hspell directly, i.e. from "make myspell" in hspell. So here's a
proposal, do away with the "hunspell-he" src.rpm and move it into hspell, which
you're the fedora maintainer of I believe ?

The above patch makes hspell create the "hunspell-he" dictionary rpms instead,
and then you can patch the hspell Makefile to change the ISO8859-8 to
ISO8859-8-I. This way the dictionaries are unified and up to date with
eachother.

The remaining problem then sounds that it is of the hunspell front end not
doing appropiate character conversion from the system locale to the dicionary
locale and vise versa. 

To get that sorted out, can you supply me with some example .txt files in the
offending encoding with some misspelled words which are close enough to
correctly spelled words that some suggestions should be given ?

Comment 4 Caolan McNamara 2007-05-21 13:14:47 UTC

I've checked in a patch to "devel" which will do the code set conversion from
the locale encoding to that of the dict for spell checking and back to the
locale encoding for the suggested words. Seems to work on some makey-up hebrew
text I made a stab at. 

And changing ISO8859-8 to ISO8859-8-I doesn't *seem* necessary, but they were
simple examples.

Comment 5 Dan Kenigsberg 2007-05-21 16:28:18 UTC

Created attachment 155106 [details]
few correct (and two incorrect) words in utf-8 encoded Hebrew

Yep, you fingured out why I care that the Hebrew dictionary works well, and
generating it directely from hspell seems like a good idea.

I think it that your encoding.patch is highly important, and I've even
submitted a fearure request regarding this upsteam
http://sourceforge.net/tracker/index.php?func=detail&aid=1633413&group_id=143754&atid=756395
.

On my system having ISO8859-8 makes hunspell-1.1.5.3-2.fc8 ignore all Hebrew
characters when run in 8 bit locale, and ISO8859-8-I works fine. Odd enough, it
is the other way around when run in UTF-8 locale: ISO8859-8 works, and
ISO8859-8-I marks everything as errors.

Comment 6 Caolan McNamara 2007-05-21 17:03:04 UTC

I suspect if you use SET RANDOM-TEXT it has the same effect as SET ISO8859-8-I

Comment 7 Caolan McNamara 2007-05-21 19:06:43 UTC

ah, I see. There's a little bit of code that naively assumes that a character
whose uppercase is the same as its lower case is not a "letter". Better to
basically use isletter with the locale temporarily set to that of the 8bit
encoding of the dict to get that information. 

Alternatively a workaround is to put all the hebrew "letters" which do not have
a case distinction into WORDCHARS in the dict.

Setting to the *-I encoding means that the iso-8859-1 settings are used for this
because it's not an encoding known to hunspell which is why that seems to work.
But because that's not an encoding that iconv knows about so that's why it
doesn't work in the other case.

Best is to fix the code, back to assigned for now. I'll cook something up
tomorrow for this.

Comment 8 Caolan McNamara 2007-05-22 08:18:57 UTC

So, I decided it's easier to use WORDCHARS in the .aff itself to note caseless
characters outside the ascii range which we want to consider part of our words
in the 8bit encoding.

So

a) character conversion patch to convert from locale to dict encoding and back
b) leave .aff dict encoding as ISO8859-8
c) add WORDCHARS ... as in hunspell-he .spec

for devel, feel free to take the hspell .spec patch and the WORDCHARS addition
and generate the hunspell-he packages from hspell unify the dictionaries.

Comment 9 Dan Kenigsberg 2007-05-22 09:01:47 UTC

Cool. I'm impressed with your solution (and your immediate response).

I still have two problems:
I forgot to tell you that modern Hebrew words may contain ' (single quote) and "
(double quote) inside them. Just adding these two characters to the WORDCHARS
did the trick for he_IL locale, but not for UTF8 locale. Consider the correct
words כנ"ל ג'ירפות (Giraffs as well)

Is there anyway to tell hunspell that all Latin characters are NOT WORDCHARS? It
is much more reasonable for a Hebrew spellchecker to accept all English than to
reject it all. (and I'm not mentioning a dream of spelling both languages at once)

Comment 10 Caolan McNamara 2007-05-22 10:05:19 UTC

Indeed, when the input text is utf-8 a different unicode splitter is used. Some
thought needed there. Perhaps take the 8bit "letter characters" of the .aff and
convert to unicode and add to the unicode splitter list of letter characters.

On point 2, something like OOo which uses the hunspell library will do its own
word splitting and script detection and assign e.g. "English", or "German",
whatever the user says it is, to the western text and run it through hunspell
with the matching dictionary, and the CTL text through hunspell with the hebrew
dictionary, and so on for CJK.

This is really the best approach for the more complex situations of mixed
language text instead of building it into the dictionary itself. i.e. a feature
request to have a command line tool which could be told to spellcheck the text
with he_IL for CTL characters, en_US for Western ones, and ja_JP for CJK ones.
But that's outside my scope here as maintainer :-)

Comment 11 Caolan McNamara 2007-05-22 11:51:52 UTC

ok, how about 1.1.5.3-3, that one takes the 8 bit dictionary wordchar list if
there is one and promotes it to unicode for use with the "source text is utf-8"
case. And the matching hunspell-he then adds those above quotes as wordchars to
the dict.

(as an aside, it *might* be the case that adding "IGNORE" to the .aff would
enable a list of letters to ignore which might allow telling hunspell that all
Latin characters are not WORDCHARS, I'm not sure, your mileage may vary there)

Comment 12 Dan Kenigsberg 2007-05-22 12:33:17 UTC

Looking good!

(but sadly, the IGNORE line seems to be ignored...)

P.S. I've committed your patch to hspell-1.0-7.fc8. Am I right to keep it there,
and not move it to F-7? I guess that it is too late for such changes in F-7.

Comment 13 Caolan McNamara 2007-05-22 12:44:39 UTC

yeah, let's not generate hunspell-he from hspell for F-7, that's done and dusted.

I'll probably release a hunspell and hunspell-he for F-7 to address these issues
in an update once things have settled with the patches in rawhide to see if
there's any unforseen consequences.

Comment 14 Michael Schwendt 2007-06-11 16:48:56 UTC

Currently in Rawhide:

hunspell-he - 0.20050112-3.fc8.noarch
  File conflict with: hunspell-he - 1.0-7.fc8.i386
     /usr/share/myspell/he_IL.aff
     /usr/share/myspell/he_IL.dic

Comment 15 Dan Kenigsberg 2007-06-11 17:31:08 UTC

(In reply to comment #14)

If I got it right, hunspell-he has been demoted to a subpackage of hspell,
effective fc8. This means that the standalone hunspell-he is to be discontinued.
Do you know how to (and whether one should) remove it from "devel"?

Comment 16 Michael Schwendt 2007-06-11 17:44:56 UTC

One way to do it would be retire hunspell-he in CVS
( http://fedoraproject.org/wiki/PackageMaintainers/PackageEndOfLife )
and ask rel-eng to remove the src.rpm and its noarch build from
the repo.

Note You need to log in before you can comment on or make changes to this bug.