222213 – hunspell not handling Hebrew dictionary well

Bug 222213 - hunspell not handling Hebrew dictionary well

Summary: hunspell not handling Hebrew dictionary well

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	hunspell
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Caolan McNamara
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-01-10 21:55 UTC by Dan Kenigsberg
Modified:	2007-11-30 22:11 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-01-11 08:24:14 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Dan Kenigsberg 2007-01-10 21:55:09 UTC

Description of problem:
After reading the Aspell Hebrew .dic and .aff files (written in the ISO8859-8
charset), Hunspell ignores Hebrew text when run in UTF-8 locale

Version-Release number of selected component (if applicable):
hunspell-1.1.4-3

How reproducible:
allways

Steps to Reproduce:
1. Take the Hebrew Aspell dictionary from
ftp://ftp.gnu.org/gnu/aspell/dict/he/aspell6-he-1.0-0.tar.bz2
2. remove the initial space character from then .dic file, and change the SET
option from ISO8859-8 to ISO8859-8-I in the .aff file
3. run hunspell -d he-IL in a UTF-8 locale
  
Actual results:
Latin words are recognized as errors. Hebrew words (in UTF8) are just ignored.
Hebrew words (in ISO8859-8-I) are treated well.

Expected results:
I would expect Hunspell to behave as Aspell6 does: no matter in what encoding is
the dictionary written in, the spelled text is assumed to be in the user's
runtime locale. If Aspell is run in UTF8 locale, it expects UTF8 text. And if it
is run in he_IL locale, it expects 8 bit text. I bet this happens in other
non-unicode encodings, but I did not check nor looked in the code.

Other than that, it would be nice if Hunspell recognized ISO8859-8 (without -I,
for "inverted"). Since ISO8859-8 is almost never used these days, it has become
a synonym of ISO8859-8-I. Also, Hunspell could behave more like Aspell and
ignore leading whitespace in the first line of .dic file. 

Additional info:
This whole bug may disappear if you say that you simply do not support 8 bit
encondings. This easy way out would require to convert the dictionary files to
UTF8 (which would make them almost twice as big), and would sorrow those of us
who keep ISO8859-8-I files. 

See also http://ivrix.org.il/bugzilla/show_bug.cgi?id=83 (Ivrix is where the
Hebrew Speller and Hebrew dictionary are made)

Comment 1 Caolan McNamara 2007-01-11 08:24:14 UTC

You should discuss this with the hunspell upstream, you can always give the
proposed http://people.redhat.com/caolanm/hunspell/ hebrew hunspell dictionary
src.rpm a whirl as well

Comment 2 Dan Kenigsberg 2007-01-11 16:51:43 UTC

Thanks, I will try to move this request upstream. But please remember that
currently hunspell-he-0.20050112-1 does not work (in neither locale). One should
change ISO8859-8 to ISO8859-8-I for it to work in 8 bit locale.

In an unrelated subject, I think you have a bug in
hunspell-1.1.4-defaultdictfromlang.patch . The lines

if ((dicname[i] == '_') && (i+2 < len)) {
    dicname[i+2] = 0;

find the underscore in (say) "he_IL", and then overwrite the final "L" with a
NULL. This produces a nonexistant dictionary file, of course.

P.S. I see that your hunspell-he RPM does not carry the text of the GPL
(probably because openoffice dropped it). I don't care much for these
legalities, but do RH "suits" allow that?

Comment 3 Caolan McNamara 2007-01-11 17:26:07 UTC

dictfromlang: sure, silly me.

GPL: http://wiki.services.openoffice.org/wiki/Dictionaries#Hebrew_.28Israel.29
is the "upstream" for the dictionaries and doesn't contain the text of the GPL
and those rpms are just direct rpms of that. If the GPL text was part of the
original (or if there is a more canonical upstream for the hunspell format
dictionaries) then dnaber at openoffice org is the man to contacted to get those
.zips updated and/or a link to the canonical home in the above page and we'll
pick up on them automatically after that

hunspell front end: yes, clearly hunspell should convert from locale to the
encoding of the dictionary when checking words, which is what the other
consumers of libhunspell, OOo and firefox do in this circumstance.

Note You need to log in before you can comment on or make changes to this bug.