Red Hat Bugzilla – Bug 485960
RFE: spellchecker should not be restricted to single language when checking multi-script document
Last modified: 2009-02-18 10:43:41 EST
Description of problem:
When checking a single script document the spellchecker needs to know what language should it use
for example if the document contains French and English text, then it should be told according to which dictionary should it use, because both use the Latin scripts language can't detected.
but if the document contains English and Arabic texts it should be able to know which parts to check against the English dictionary and which against the Arabic
Steps to Reproduce:
1. start pidgin or firefox with hunspell-ar and hunspell-en install
2. type a mixed text like "Peace be upon those who follow guidance السلام على من اتبع الهدى"
only on of the two messages will be checked
both should be checked
some scripts represents more than one languages including the Arabic script which is used by Persian language for example
the spell check should work like this
for word in words:
if n==1: use_this_dict()
of course get_number_of_installed_dicts_for_script should use some sort of cache
I reported this upstream
Well, something like this works fine in OpenOffice.org where a language can be specified for the three blocks of CJK/CTL or Western. Its basically up to individual applications to tell hunspell which language is being used for each script.
yes, it works in openoffice.org
maybe because OO.o calls it in several passes
it's better be done at one pass within hunspell
I remember in F9 when fedora did an effort to make all applications uses unified dictionaries
this is similar, why should it be done in each application ?
The hunspell spellchecker just takes text and a dictionary to use for that text.
To support multiple languages in a text it really needs to be broken up and passed to hunspell in "text and matching dictionary" chunks. OOo has the concept of a language attribute for ranges of text so it can do that, and auto categorizes text into three script-types. So it can spell check LATIN text with e.g. en_US and CTL text with arabic.
Some other applications (e.g. the evolution composer) don't have a concept of setting a language for specific ranges of text or types of script, which makes it difficult for them to do that. If hunspell itself takes text and splits it up, it needs to guess the dictionaries to use for each type of script, and it can only ever be a guess as many languages share scripts and there is only one locale entry for the "primary" script to indicate which it is.
A guess might be acceptable at e.g. application level where there can be some means to override it, but at a library level in e.g. hunspell, it wouldn't be a particularly good idea in general for the existing hunspell apis to break it up automatically and guess which language it is. Though there is a libtextcat which tries to guess the language of text in fedora, so that could be a fun project to give hunspell or enchant an additional api for "guess the languages in this text and spell-check them from this set of dictionaries" and the guessing could be done with libtextcat
first let me thank you,
and yes it can be done at each application and make every application reinvent oo.o wheel
but it would be very simple to do it once here
please try to understand my approach
>Some other applications (e.g. the evolution composer) don't have a concept of
>setting a language for specific ranges of text or types of script, which makes
>it difficult for them to do that.
yes, this is the point, applications like pidgin xchat firefox gedit and evolution
don't have that concept, should we rewrite all the applications to do a multi pass spell checking or should we fix the spellchecker
>it needs to guess the dictionaries to use for each type of script, and it
>can only ever be a guess as many languages share scripts and there is only one
>locale entry for the "primary" script to indicate which it is.
the guess should be made using 1. passed parameter 2. user locale 3. installed dictionaries
yes, if I have a text in English and Arabic and my locale is English
and I have French and English and Arabic dictionaries installed
and as you know English and French uses the same script (alphabet roughly speaking)
and Arabic and Persian uses the same script
for Latin script: English is used unless the user chooses French (as you said most applications allows the user to set a language, currently one language)
for Arabic script: Arabic will be used because I don't have Persian dictionary installed
so all applications will work fine without needing to check what dictionaries are installed and set script ranges for pieces of text
> The hunspell spellchecker just takes text and a dictionary to use for that text.
> ... general for the existing hunspell apis to break it up
yes, and I don't want to change that
my idea hunspell should accept a piece of text and zero or more language (each from different script) this will be compatible with current behavior because 1 satisfies zero or more ie. the new functionality can implement the old behavior with one line
def check(text,lang): check_new(text,[lang,])
anyway, let assume it get the sample text above and en as the language
hunspell will for sure know that the Arabic words in the text should not be matched against the English dictionary because each uses a different alphabet (that check is a very trivial check just ask glib about the script of the first char of the word which will just see from which unicode block did this char come from)
so my suggestion is to check if any dictionary is installed for that script if there are only one dictionaries it will check against it, if not it will do the old behavior ie. report that it wasn't checked
ie. we only changed the behavior of the application in places where it used to fail before not in the places where it used to work
and there is no need to do it in many passes
the first char of each word should be checked
if it's not from the script of an already loaded dictionaries load a new dictionary given script if failed (more than one dictionary is installed and the parameter of given languages can't help and locale can't help...) the usual behavior just continue
Sure. Like I said it 'could be a fun project to give hunspell or enchant an additional api for "guess the languages in this text and spell-check them from this set of dictionaries"', or your alternative means. I'm just not volunteering to be the one to do it. Hence, closed->upstream :-)