Description of problem: After reading the Aspell Hebrew .dic and .aff files (written in the ISO8859-8 charset), Hunspell ignores Hebrew text when run in UTF-8 locale Version-Release number of selected component (if applicable): hunspell-1.1.4-3 How reproducible: allways Steps to Reproduce: 1. Take the Hebrew Aspell dictionary from ftp://ftp.gnu.org/gnu/aspell/dict/he/aspell6-he-1.0-0.tar.bz2 2. remove the initial space character from then .dic file, and change the SET option from ISO8859-8 to ISO8859-8-I in the .aff file 3. run hunspell -d he-IL in a UTF-8 locale Actual results: Latin words are recognized as errors. Hebrew words (in UTF8) are just ignored. Hebrew words (in ISO8859-8-I) are treated well. Expected results: I would expect Hunspell to behave as Aspell6 does: no matter in what encoding is the dictionary written in, the spelled text is assumed to be in the user's runtime locale. If Aspell is run in UTF8 locale, it expects UTF8 text. And if it is run in he_IL locale, it expects 8 bit text. I bet this happens in other non-unicode encodings, but I did not check nor looked in the code. Other than that, it would be nice if Hunspell recognized ISO8859-8 (without -I, for "inverted"). Since ISO8859-8 is almost never used these days, it has become a synonym of ISO8859-8-I. Also, Hunspell could behave more like Aspell and ignore leading whitespace in the first line of .dic file. Additional info: This whole bug may disappear if you say that you simply do not support 8 bit encondings. This easy way out would require to convert the dictionary files to UTF8 (which would make them almost twice as big), and would sorrow those of us who keep ISO8859-8-I files. See also http://ivrix.org.il/bugzilla/show_bug.cgi?id=83 (Ivrix is where the Hebrew Speller and Hebrew dictionary are made)
You should discuss this with the hunspell upstream, you can always give the proposed http://people.redhat.com/caolanm/hunspell/ hebrew hunspell dictionary src.rpm a whirl as well
Thanks, I will try to move this request upstream. But please remember that currently hunspell-he-0.20050112-1 does not work (in neither locale). One should change ISO8859-8 to ISO8859-8-I for it to work in 8 bit locale. In an unrelated subject, I think you have a bug in hunspell-1.1.4-defaultdictfromlang.patch . The lines if ((dicname[i] == '_') && (i+2 < len)) { dicname[i+2] = 0; find the underscore in (say) "he_IL", and then overwrite the final "L" with a NULL. This produces a nonexistant dictionary file, of course. P.S. I see that your hunspell-he RPM does not carry the text of the GPL (probably because openoffice dropped it). I don't care much for these legalities, but do RH "suits" allow that?
dictfromlang: sure, silly me. GPL: http://wiki.services.openoffice.org/wiki/Dictionaries#Hebrew_.28Israel.29 is the "upstream" for the dictionaries and doesn't contain the text of the GPL and those rpms are just direct rpms of that. If the GPL text was part of the original (or if there is a more canonical upstream for the hunspell format dictionaries) then dnaber at openoffice org is the man to contacted to get those .zips updated and/or a link to the canonical home in the above page and we'll pick up on them automatically after that hunspell front end: yes, clearly hunspell should convert from locale to the encoding of the dictionary when checking words, which is what the other consumers of libhunspell, OOo and firefox do in this circumstance.