Bug 240696
Summary: | charset ISO8859-8 in affix file unknown to hunspell | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Dan Kenigsberg <danken> |
Component: | hunspell | Assignee: | Caolan McNamara <caolanm> |
Status: | CLOSED RAWHIDE | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | rawhide | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-06-19 15:15:13 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Dan Kenigsberg
2007-05-20 16:25:09 UTC
Created attachment 155062 [details]
change charset name to something recognized by hunspell
Another problem, is that Hunspell spellchecks only in the stated character set. This means that current .aff and .dic files are useless for someone using UTF-8 locale (which may well be a majority). To make hunspell useful in Unicode evironment, one should: 1. change that SET line to SET UTF-8 2. convert the data files to UTF-8, with something like: $ iconv -f hebrew -t utf8 </tmp/he_IL.aff >he_IL.aff $ iconv -f hebrew -t utf8 </tmp/he_IL.dic >he_IL.dic This renders Hunspell useless for 8bit Hebrew, and expands the data size almost twofold. Created attachment 155078 [details]
patch to move the hunspell-he dictionaries into hspell package
Looking through the history of where these dictionaries come from I see that
they are from hspell directly, i.e. from "make myspell" in hspell. So here's a
proposal, do away with the "hunspell-he" src.rpm and move it into hspell, which
you're the fedora maintainer of I believe ?
The above patch makes hspell create the "hunspell-he" dictionary rpms instead,
and then you can patch the hspell Makefile to change the ISO8859-8 to
ISO8859-8-I. This way the dictionaries are unified and up to date with
eachother.
The remaining problem then sounds that it is of the hunspell front end not
doing appropiate character conversion from the system locale to the dicionary
locale and vise versa.
To get that sorted out, can you supply me with some example .txt files in the
offending encoding with some misspelled words which are close enough to
correctly spelled words that some suggestions should be given ?
I've checked in a patch to "devel" which will do the code set conversion from the locale encoding to that of the dict for spell checking and back to the locale encoding for the suggested words. Seems to work on some makey-up hebrew text I made a stab at. And changing ISO8859-8 to ISO8859-8-I doesn't *seem* necessary, but they were simple examples. Created attachment 155106 [details] few correct (and two incorrect) words in utf-8 encoded Hebrew Yep, you fingured out why I care that the Hebrew dictionary works well, and generating it directely from hspell seems like a good idea. I think it that your encoding.patch is highly important, and I've even submitted a fearure request regarding this upsteam http://sourceforge.net/tracker/index.php?func=detail&aid=1633413&group_id=143754&atid=756395 . On my system having ISO8859-8 makes hunspell-1.1.5.3-2.fc8 ignore all Hebrew characters when run in 8 bit locale, and ISO8859-8-I works fine. Odd enough, it is the other way around when run in UTF-8 locale: ISO8859-8 works, and ISO8859-8-I marks everything as errors. I suspect if you use SET RANDOM-TEXT it has the same effect as SET ISO8859-8-I ah, I see. There's a little bit of code that naively assumes that a character whose uppercase is the same as its lower case is not a "letter". Better to basically use isletter with the locale temporarily set to that of the 8bit encoding of the dict to get that information. Alternatively a workaround is to put all the hebrew "letters" which do not have a case distinction into WORDCHARS in the dict. Setting to the *-I encoding means that the iso-8859-1 settings are used for this because it's not an encoding known to hunspell which is why that seems to work. But because that's not an encoding that iconv knows about so that's why it doesn't work in the other case. Best is to fix the code, back to assigned for now. I'll cook something up tomorrow for this. So, I decided it's easier to use WORDCHARS in the .aff itself to note caseless characters outside the ascii range which we want to consider part of our words in the 8bit encoding. So a) character conversion patch to convert from locale to dict encoding and back b) leave .aff dict encoding as ISO8859-8 c) add WORDCHARS ... as in hunspell-he .spec for devel, feel free to take the hspell .spec patch and the WORDCHARS addition and generate the hunspell-he packages from hspell unify the dictionaries. Cool. I'm impressed with your solution (and your immediate response). I still have two problems: I forgot to tell you that modern Hebrew words may contain ' (single quote) and " (double quote) inside them. Just adding these two characters to the WORDCHARS did the trick for he_IL locale, but not for UTF8 locale. Consider the correct words כנ"ל ג'ירפות (Giraffs as well) Is there anyway to tell hunspell that all Latin characters are NOT WORDCHARS? It is much more reasonable for a Hebrew spellchecker to accept all English than to reject it all. (and I'm not mentioning a dream of spelling both languages at once) Indeed, when the input text is utf-8 a different unicode splitter is used. Some thought needed there. Perhaps take the 8bit "letter characters" of the .aff and convert to unicode and add to the unicode splitter list of letter characters. On point 2, something like OOo which uses the hunspell library will do its own word splitting and script detection and assign e.g. "English", or "German", whatever the user says it is, to the western text and run it through hunspell with the matching dictionary, and the CTL text through hunspell with the hebrew dictionary, and so on for CJK. This is really the best approach for the more complex situations of mixed language text instead of building it into the dictionary itself. i.e. a feature request to have a command line tool which could be told to spellcheck the text with he_IL for CTL characters, en_US for Western ones, and ja_JP for CJK ones. But that's outside my scope here as maintainer :-) ok, how about 1.1.5.3-3, that one takes the 8 bit dictionary wordchar list if there is one and promotes it to unicode for use with the "source text is utf-8" case. And the matching hunspell-he then adds those above quotes as wordchars to the dict. (as an aside, it *might* be the case that adding "IGNORE" to the .aff would enable a list of letters to ignore which might allow telling hunspell that all Latin characters are not WORDCHARS, I'm not sure, your mileage may vary there) Looking good! (but sadly, the IGNORE line seems to be ignored...) P.S. I've committed your patch to hspell-1.0-7.fc8. Am I right to keep it there, and not move it to F-7? I guess that it is too late for such changes in F-7. yeah, let's not generate hunspell-he from hspell for F-7, that's done and dusted. I'll probably release a hunspell and hunspell-he for F-7 to address these issues in an update once things have settled with the patches in rawhide to see if there's any unforseen consequences. Currently in Rawhide: hunspell-he - 0.20050112-3.fc8.noarch File conflict with: hunspell-he - 1.0-7.fc8.i386 /usr/share/myspell/he_IL.aff /usr/share/myspell/he_IL.dic (In reply to comment #14) If I got it right, hunspell-he has been demoted to a subpackage of hspell, effective fc8. This means that the standalone hunspell-he is to be discontinued. Do you know how to (and whether one should) remove it from "devel"? One way to do it would be retire hunspell-he in CVS ( http://fedoraproject.org/wiki/PackageMaintainers/PackageEndOfLife ) and ask rel-eng to remove the src.rpm and its noarch build from the repo. |