Bug 222213 - hunspell not handling Hebrew dictionary well
hunspell not handling Hebrew dictionary well
Product: Fedora
Classification: Fedora
Component: hunspell (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Caolan McNamara
Depends On:
  Show dependency treegraph
Reported: 2007-01-10 16:55 EST by Dan Kenigsberg
Modified: 2007-11-30 17:11 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-01-11 03:24:14 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Dan Kenigsberg 2007-01-10 16:55:09 EST
Description of problem:
After reading the Aspell Hebrew .dic and .aff files (written in the ISO8859-8
charset), Hunspell ignores Hebrew text when run in UTF-8 locale

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Take the Hebrew Aspell dictionary from
2. remove the initial space character from then .dic file, and change the SET
option from ISO8859-8 to ISO8859-8-I in the .aff file
3. run hunspell -d he-IL in a UTF-8 locale
Actual results:
Latin words are recognized as errors. Hebrew words (in UTF8) are just ignored.
Hebrew words (in ISO8859-8-I) are treated well.

Expected results:
I would expect Hunspell to behave as Aspell6 does: no matter in what encoding is
the dictionary written in, the spelled text is assumed to be in the user's
runtime locale. If Aspell is run in UTF8 locale, it expects UTF8 text. And if it
is run in he_IL locale, it expects 8 bit text. I bet this happens in other
non-unicode encodings, but I did not check nor looked in the code.

Other than that, it would be nice if Hunspell recognized ISO8859-8 (without -I,
for "inverted"). Since ISO8859-8 is almost never used these days, it has become
a synonym of ISO8859-8-I. Also, Hunspell could behave more like Aspell and
ignore leading whitespace in the first line of .dic file. 

Additional info:
This whole bug may disappear if you say that you simply do not support 8 bit
encondings. This easy way out would require to convert the dictionary files to
UTF8 (which would make them almost twice as big), and would sorrow those of us
who keep ISO8859-8-I files. 

See also http://ivrix.org.il/bugzilla/show_bug.cgi?id=83 (Ivrix is where the
Hebrew Speller and Hebrew dictionary are made)
Comment 1 Caolan McNamara 2007-01-11 03:24:14 EST
You should discuss this with the hunspell upstream, you can always give the
proposed http://people.redhat.com/caolanm/hunspell/ hebrew hunspell dictionary
src.rpm a whirl as well
Comment 2 Dan Kenigsberg 2007-01-11 11:51:43 EST
Thanks, I will try to move this request upstream. But please remember that
currently hunspell-he-0.20050112-1 does not work (in neither locale). One should
change ISO8859-8 to ISO8859-8-I for it to work in 8 bit locale.

In an unrelated subject, I think you have a bug in
hunspell-1.1.4-defaultdictfromlang.patch . The lines

if ((dicname[i] == '_') && (i+2 < len)) {
    dicname[i+2] = 0;

find the underscore in (say) "he_IL", and then overwrite the final "L" with a
NULL. This produces a nonexistant dictionary file, of course.

P.S. I see that your hunspell-he RPM does not carry the text of the GPL
(probably because openoffice dropped it). I don't care much for these
legalities, but do RH "suits" allow that?
Comment 3 Caolan McNamara 2007-01-11 12:26:07 EST
dictfromlang: sure, silly me.

GPL: http://wiki.services.openoffice.org/wiki/Dictionaries#Hebrew_.28Israel.29
is the "upstream" for the dictionaries and doesn't contain the text of the GPL
and those rpms are just direct rpms of that. If the GPL text was part of the
original (or if there is a more canonical upstream for the hunspell format
dictionaries) then dnaber at openoffice org is the man to contacted to get those
.zips updated and/or a link to the canonical home in the above page and we'll
pick up on them automatically after that

hunspell front end: yes, clearly hunspell should convert from locale to the
encoding of the dictionary when checking words, which is what the other
consumers of libhunspell, OOo and firefox do in this circumstance.

Note You need to log in before you can comment on or make changes to this bug.