Red Hat Bugzilla – Bug 240696
charset ISO8859-8 in affix file unknown to hunspell
Last modified: 2007-11-30 17:12:04 EST
Description of problem:
Hunspell does not recognize Hebrew input
Version-Release number of selected component (if applicable):
How reproducible: Always
Steps to Reproduce:
1. run hunspell -d he_IL in he_IL locale.
2. type Hebrew words.
Hebrew characters are ignored, as if they where whitespace.
Correct words accepted, misspelled words rejected.
Hunspell seems unfamiliar with the name ISO8859-8 of the character set. When the
affix file is changed to have ISO8859-8-I the problem disappears. (and it is,
puritanly speaking, the correct charset of the affix file).
Created attachment 155062 [details]
change charset name to something recognized by hunspell
Another problem, is that Hunspell spellchecks only in the stated character set.
This means that current .aff and .dic files are useless for someone using UTF-8
locale (which may well be a majority).
To make hunspell useful in Unicode evironment, one should:
1. change that SET line to SET UTF-8
2. convert the data files to UTF-8, with something like:
$ iconv -f hebrew -t utf8 </tmp/he_IL.aff >he_IL.aff
$ iconv -f hebrew -t utf8 </tmp/he_IL.dic >he_IL.dic
This renders Hunspell useless for 8bit Hebrew, and expands the data size almost
Created attachment 155078 [details]
patch to move the hunspell-he dictionaries into hspell package
Looking through the history of where these dictionaries come from I see that
they are from hspell directly, i.e. from "make myspell" in hspell. So here's a
proposal, do away with the "hunspell-he" src.rpm and move it into hspell, which
you're the fedora maintainer of I believe ?
The above patch makes hspell create the "hunspell-he" dictionary rpms instead,
and then you can patch the hspell Makefile to change the ISO8859-8 to
ISO8859-8-I. This way the dictionaries are unified and up to date with
The remaining problem then sounds that it is of the hunspell front end not
doing appropiate character conversion from the system locale to the dicionary
locale and vise versa.
To get that sorted out, can you supply me with some example .txt files in the
offending encoding with some misspelled words which are close enough to
correctly spelled words that some suggestions should be given ?
I've checked in a patch to "devel" which will do the code set conversion from
the locale encoding to that of the dict for spell checking and back to the
locale encoding for the suggested words. Seems to work on some makey-up hebrew
text I made a stab at.
And changing ISO8859-8 to ISO8859-8-I doesn't *seem* necessary, but they were
Created attachment 155106 [details]
few correct (and two incorrect) words in utf-8 encoded Hebrew
Yep, you fingured out why I care that the Hebrew dictionary works well, and
generating it directely from hspell seems like a good idea.
I think it that your encoding.patch is highly important, and I've even
submitted a fearure request regarding this upsteam
On my system having ISO8859-8 makes hunspell-126.96.36.199-2.fc8 ignore all Hebrew
characters when run in 8 bit locale, and ISO8859-8-I works fine. Odd enough, it
is the other way around when run in UTF-8 locale: ISO8859-8 works, and
ISO8859-8-I marks everything as errors.
I suspect if you use SET RANDOM-TEXT it has the same effect as SET ISO8859-8-I
ah, I see. There's a little bit of code that naively assumes that a character
whose uppercase is the same as its lower case is not a "letter". Better to
basically use isletter with the locale temporarily set to that of the 8bit
encoding of the dict to get that information.
Alternatively a workaround is to put all the hebrew "letters" which do not have
a case distinction into WORDCHARS in the dict.
Setting to the *-I encoding means that the iso-8859-1 settings are used for this
because it's not an encoding known to hunspell which is why that seems to work.
But because that's not an encoding that iconv knows about so that's why it
doesn't work in the other case.
Best is to fix the code, back to assigned for now. I'll cook something up
tomorrow for this.
So, I decided it's easier to use WORDCHARS in the .aff itself to note caseless
characters outside the ascii range which we want to consider part of our words
in the 8bit encoding.
a) character conversion patch to convert from locale to dict encoding and back
b) leave .aff dict encoding as ISO8859-8
c) add WORDCHARS ... as in hunspell-he .spec
for devel, feel free to take the hspell .spec patch and the WORDCHARS addition
and generate the hunspell-he packages from hspell unify the dictionaries.
Cool. I'm impressed with your solution (and your immediate response).
I still have two problems:
I forgot to tell you that modern Hebrew words may contain ' (single quote) and "
(double quote) inside them. Just adding these two characters to the WORDCHARS
did the trick for he_IL locale, but not for UTF8 locale. Consider the correct
words כנ"ל ג'ירפות (Giraffs as well)
Is there anyway to tell hunspell that all Latin characters are NOT WORDCHARS? It
is much more reasonable for a Hebrew spellchecker to accept all English than to
reject it all. (and I'm not mentioning a dream of spelling both languages at once)
Indeed, when the input text is utf-8 a different unicode splitter is used. Some
thought needed there. Perhaps take the 8bit "letter characters" of the .aff and
convert to unicode and add to the unicode splitter list of letter characters.
On point 2, something like OOo which uses the hunspell library will do its own
word splitting and script detection and assign e.g. "English", or "German",
whatever the user says it is, to the western text and run it through hunspell
with the matching dictionary, and the CTL text through hunspell with the hebrew
dictionary, and so on for CJK.
This is really the best approach for the more complex situations of mixed
language text instead of building it into the dictionary itself. i.e. a feature
request to have a command line tool which could be told to spellcheck the text
with he_IL for CTL characters, en_US for Western ones, and ja_JP for CJK ones.
But that's outside my scope here as maintainer :-)
ok, how about 188.8.131.52-3, that one takes the 8 bit dictionary wordchar list if
there is one and promotes it to unicode for use with the "source text is utf-8"
case. And the matching hunspell-he then adds those above quotes as wordchars to
(as an aside, it *might* be the case that adding "IGNORE" to the .aff would
enable a list of letters to ignore which might allow telling hunspell that all
Latin characters are not WORDCHARS, I'm not sure, your mileage may vary there)
(but sadly, the IGNORE line seems to be ignored...)
P.S. I've committed your patch to hspell-1.0-7.fc8. Am I right to keep it there,
and not move it to F-7? I guess that it is too late for such changes in F-7.
yeah, let's not generate hunspell-he from hspell for F-7, that's done and dusted.
I'll probably release a hunspell and hunspell-he for F-7 to address these issues
in an update once things have settled with the patches in rawhide to see if
there's any unforseen consequences.
Currently in Rawhide:
hunspell-he - 0.20050112-3.fc8.noarch
File conflict with: hunspell-he - 1.0-7.fc8.i386
(In reply to comment #14)
If I got it right, hunspell-he has been demoted to a subpackage of hspell,
effective fc8. This means that the standalone hunspell-he is to be discontinued.
Do you know how to (and whether one should) remove it from "devel"?
One way to do it would be retire hunspell-he in CVS
( http://fedoraproject.org/wiki/PackageMaintainers/PackageEndOfLife )
and ask rel-eng to remove the src.rpm and its noarch build from