Bug 197737 - german spellchecking does not work with UTF-8 encoding
german spellchecking does not work with UTF-8 encoding
Status: CLOSED NEXTRELEASE
Product: Fedora
Classification: Fedora
Component: emacs (Show other bugs)
5
All Linux
medium Severity medium
: ---
: ---
Assigned To: Chip Coldwell
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-07-05 17:28 EDT by Daniel Hammer
Modified: 2009-06-22 19:50 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-08-03 12:46:03 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Test case (159 bytes, application/octet-stream)
2006-07-12 15:41 EDT, Chip Coldwell
no flags Details
iso8859-1 (Latin-1) encoding of the test case (155 bytes, text/plain)
2006-07-17 11:47 EDT, Chip Coldwell
no flags Details
UTF-8 encoding of the test case (159 bytes, text/x-tex)
2006-07-17 12:07 EDT, Chip Coldwell
no flags Details
fix German spell checking for UTF-8 encoded buffers (1.07 KB, patch)
2006-11-03 10:02 EST, Chip Coldwell
no flags Details | Diff
ispell/emacs problem showbug (45.44 KB, image/png)
2009-06-22 19:50 EDT, Daniel
no flags Details

  None (edit)
Description Daniel Hammer 2006-07-05 17:28:26 EDT
Description of problem:


Version-Release number of selected component (if applicable): 
FC5 vanilla + emacs; the same applies to FC6 test 1

How reproducible:
Write some .tex file with german Umlauts and try to spell check it. 
Emacs spellchecking devides each word into the part befor the Umlaut and the
part after the Umlaut
The same works fine for FC4

Steps to Reproduce:
1. install FC5 with emacs and german language support
2. create a text with some german umlauts in it (e.g a .tex file with the words
Gerüchte, Widerstände, Gefühl, Veränderung)
3. try to spell check it 
  
Actual results: words are not recognized, each word is cut into the part before
the Umlaut and the part after the Umlaut
No change of the language setting in emacs options works, replacing the actual
.emacs  file with the one of FC4 (where it works fine) does no help at all

Expected results:
As in FC4, during the spell check process each of the words above should be
recognized as a whole and should not be devided in two parts before the Umlaut
and after the Umlaut

Additional info:
$ env | grep -i lang
LANG=de_DE.UTF-8
Comment 1 Chip Coldwell 2006-07-10 10:38:29 EDT
(In reply to comment #0)
> Description of problem:
> 
> 
> Version-Release number of selected component (if applicable): 
> FC5 vanilla + emacs; the same applies to FC6 test 1
> 
> How reproducible:
> Write some .tex file with german Umlauts and try to spell check it. 
> Emacs spellchecking devides each word into the part befor the Umlaut and the
> part after the Umlaut
> The same works fine for FC4
> 
> Steps to Reproduce:
> 1. install FC5 with emacs and german language support
> 2. create a text with some german umlauts in it (e.g a .tex file with the words
> Gerüchte, Widerstände, Gefühl, Veränderung)
> 3. try to spell check it 

I'm not sure if this is supposed to work.  What happens if you restrict
yourself to the 7-bit ASCII charset, i.e. spell these words thus:

Ger\"uchte, Widerst\"ande, Gef\"uhl, Ver\"anderburg

in TeX, and see if you can spellcheck that?

Chip
Comment 2 Daniel Hammer 2006-07-10 11:22:35 EDT
No, emacs + ispell would only recognise each word until \" and cut it again into
2 parts. I was lucky to include 

\usepackage[german]{babel}
\usepackage[utf8]{inputenc}

in my tex files under FC4 to make spellchecking work. Since I have US layout of
my keyboard and cannot type ü, ä, ... so I change first each \"u, \"a, ... into
ü, ä, ... in order to make spellchecking in emacs work.
Comment 3 Chip Coldwell 2006-07-10 11:34:31 EDT
(In reply to comment #2)
> No, emacs + ispell would only recognise each word until \" and cut it again into
> 2 parts. I was lucky to include 
> 
> \usepackage[german]{babel}
> \usepackage[utf8]{inputenc}
> 
> in my tex files under FC4 to make spellchecking work. 

Could you send me the value of the variable ispell-dictionary-alist (C-h C-v
then type the variable name)?

Chip
Comment 4 Daniel Hammer 2006-07-10 13:27:26 EDT
I have defined the following in ~/.emacs:

(setq ispell-local-dictionary-alist
 '(("german"
    "[a-zA-Z\"]" "[^a-zA-Z\"]" "[']" t ("-C") "~tex" iso-8859-1)
   ("german8"
    "[a-zA-ZÄÖÜäöüß]" "[^a-zA-ZÄÖÜäöüß]" "" t ("-d" "german") "~latin1"
iso-8859-1)))

Is it this what you ment above?
Comment 5 Chip Coldwell 2006-07-10 13:41:08 EDT
(In reply to comment #4)
> I have defined the following in ~/.emacs:
> 
> (setq ispell-local-dictionary-alist
>  '(("german"
>     "[a-zA-Z\"]" "[^a-zA-Z\"]" "[']" t ("-C") "~tex" iso-8859-1)
>    ("german8"
>     "[a-zA-ZÄÖÜäöüß]" "[^a-zA-ZÄÖÜäöüß]" "" t ("-d" "german") "~latin1"
> iso-8859-1)))
> 
> Is it this what you ment above?

That variable overrides the default in ispell-dictionary-alist.  What happens if
you comment this out of your ~/.emacs?

Chip
Comment 6 Daniel Hammer 2006-07-11 09:21:32 EDT
Nothing, same results.
Comment 7 Chip Coldwell 2006-07-12 15:41:46 EDT
Created attachment 132327 [details]
Test case
Comment 8 Chip Coldwell 2006-07-12 15:55:00 EDT
I think I see what the problem is.  A dump of the characters in the test case
above gives this:

$ od -bc test.tex
0000000 134 144 157 143 165 155 145 156 164 143 154 141 163 163 173 155
          \   d   o   c   u   m   e   n   t   c   l   a   s   s   {   m
0000020 151 156 151 155 141 154 175 012 134 165 163 145 160 141 143 153
          i   n   i   m   a   l   }  \n   \   u   s   e   p   a   c   k
0000040 141 147 145 133 147 145 162 155 141 156 135 173 142 141 142 145
          a   g   e   [   g   e   r   m   a   n   ]   {   b   a   b   e
0000060 154 175 012 134 165 163 145 160 141 143 153 141 147 145 133 165
          l   }  \n   \   u   s   e   p   a   c   k   a   g   e   [   u
0000100 164 146 070 135 173 151 156 160 165 164 145 156 143 175 012 134
          t   f   8   ]   {   i   n   p   u   t   e   n   c   }  \n   \
0000120 142 145 147 151 156 173 144 157 143 165 155 145 156 164 175 012
          b   e   g   i   n   {   d   o   c   u   m   e   n   t   }  \n
0000140 107 145 162 303 274 143 150 164 145 054 040 127 151 144 145 162
          G   e   r 303 274   c   h   t   e   ,       W   i   d   e   r
0000160 163 164 303 244 156 144 145 054 040 107 145 146 303 274 150 154
          s   t 303 244   n   d   e   ,       G   e   f 303 274   h   l
0000200 054 040 126 145 162 303 244 156 144 145 162 165 156 147 012 134
          ,       V   e   r 303 244   n   d   e   r   u   n   g  \n   \
0000220 145 156 144 173 144 157 143 165 155 145 156 164 175 012 012
          e   n   d   {   d   o   c   u   m   e   n   t   }  \n  \n
0000237

From which I conclude that the UTF-8 encoding for \"u is \303\274 and for \"a is
\303\244 (octal escape sequences).  If I look in the ispell-dictionary-alist
CASECHARs regexp for the german8 dictionary, I find

[a-zA-Z\304\326\334\344\366\337\374]

In other words, the encoding of this regexp doesn't match the encoding of the
emacs buffer.

That is, at a very low level, why ispell is breaking up these words.

Chip
Comment 9 Chip Coldwell 2006-07-12 16:08:24 EDT
(In reply to comment #8)

> [a-zA-Z\304\326\334\344\366\337\374]

Indeed, these are the encodings from the Latin-8 character set, not UTF-8.

The problem, then is that emacs is trying to break up UTF-8 words using a
Latin-8 regexp.  Hmm.

Perhaps you could try the following: visit your .emacs file in the usual way 

C-x C-f

and restore your custom ispell-local-dictionary-alist from comment #4 above,
then explicitly switch the buffer to the UTF-8 encoding by running

C-x RET f 

Then type 'utf-8-unix' at the prompt.  Now save your .emacs file in the usual way

C-x C-s

and restart emacs.  Let me know if that solves the problem.

Chip
Comment 10 Daniel Hammer 2006-07-12 21:30:02 EDT
Hi Chip, thanx for your efforts. But it doesn't work. Doing all the things
you've told me bring the same results as before. I tried another thing:

 o change ä to ae and ö to oe and ü to ue
 o tried to spell check
 o result: Gerüchte, Widerstände, Gefühl, Veränderung

The latter ones are proposals when spellchecking with emacs, with the above
addition and also with the standard .emacs. Even when you create a file without
german umlauts and try to spellcheck it you'll get the same. ;-\
Comment 11 Chip Coldwell 2006-07-13 13:19:24 EDT
Emacs emits an interesting warning if you place the point on a word containing
an umlaut and run M-$ to spellcheck just that word:

"Ispell and its process have different character maps"

I've tracked down that error in the source, and I'm trying to grok its meaning.
 This seems to be independent of whether the buffer is using iso-8859-1
(Latin-1) or utf-8-unix encoding.

Chip
Comment 12 Daniel Hammer 2006-07-13 15:38:04 EDT
Yep, exactly. This is maybe related to bug 141968. Interestingly try the
following under FC4 (where spell checking of umlauts still works):

(1) open your test.tex with emacs and check the spelling ==> everything is fine.

(2) now try aspell -c test.tex and you'll get the same results as under FC5
    within emacs.

So even under FC4 outside emacs spell checking is broken. 
I've changed /usr/share/emacs/21.4/lisp/loaddefs.el from FC5 to the one of FC4
with no results. 
On FC4 and FC5 the emacs versions are emacs-21.4-5 and emacs-21.4-14. Maybe it
is interesting to know what happened exactly from emacs version 21.4-8 to
21.4-9. Aspell versions are aspell-0.50.5-6 and aspell-0.60.3-5.

I've found also something interesting in 

http://www.kdstevens.com/stevens/ispell-faq.html

under "Umlats break words in ispell.el".
Comment 13 Chip Coldwell 2006-07-17 11:42:43 EDT
(In reply to comment #12)

> (2) now try aspell -c test.tex and you'll get the same results as under FC5
>     within emacs.

Bingo!  Now try this

LANG=de_DE.UTF-8 aspell -c --mode=tex --encoding=iso8859-1 test-iso8859-1.tex

or

LANG=de_DE.UTF-8 aspell -c --mode=tex test-utf-8.tex

and it works (the tex file in the two different encodings are attached).

Comment 14 Chip Coldwell 2006-07-17 11:47:19 EDT
Created attachment 132552 [details]
iso8859-1 (Latin-1) encoding of the test case
Comment 15 Chip Coldwell 2006-07-17 12:07:19 EDT
Created attachment 132555 [details]
UTF-8 encoding of the test case
Comment 16 Chip Coldwell 2006-07-17 12:10:57 EDT
Actually, even better is to get the LANG environment variable to match the
actual encoding of the file:

LANG=de_DE.iso88591 aspell -c --mode=tex test-iso8859-1.tex
LANG=de_DE.utf-8 aspell -c --mode=tex test-utf-8.tex

both work.

So I tried opening the test-iso8859-1.tex file in emacs, did

M-x setenv
LANG
de_DE.iso88591

and then 

M-x ispell-buffer

and it works just fine.

Chip
Comment 17 Daniel Hammer 2006-07-18 11:05:43 EDT
Okay; iso88591 works fine! ;-) 

But most files you create under FC5 are utf-8. The same procedure described
above does not work when substituting de_DE.iso88591 with de_DE.utf-8.

I've droped my additions to ~/.emacs and use now vanilla FC5.

BTW, I've tried to spell check german umlauts with emacs under Ubuntu and Suse
... and it worked as excellent as it worked for FC4. But 

(1) I don't like these distributions as much as I like Fedora
(2) they use approx. the same versions of emacs and aspell ... so it seems to 
    be FC specific.

It make no sense for me to switch to FC5 if working with german or other
non-english tex files does not work anymore (I do not use office at all). 
FC6 test 1 unfortunately has the same behaviour as FC5.
Comment 18 Chip Coldwell 2006-07-27 10:10:06 EDT
I've made a little more progress.  Apparently, setting the LANG environment
variable isn't enough.  Once in emacs, run

C-h v "ispell-local-dictionary"

then customize the value of this variable to be "deutsch8" and save it for
future sessions (alternatively, put (setq ispell-local-dictionary "deutsch8")
in your .emacs).

Once you've done that, the ispell-get-word uses the right regexp to find the
beginning and end of the word.  However, now I get the error message

Ispell and its process have different character maps

so apparently the aspell process is still being started with the latin-1 
character set.


Comment 19 Chip Coldwell 2006-07-28 10:22:39 EDT
I've done some more experiments.  If you choose the "deutsch8" dictionary
in emacs 21.4, it doesn't break words on umlauts, but "Ispell and its
process have different character maps".  If you choose the "deutsch" dictionary,
then it breaks words on umlauts but ispell/aspell runs happily.

On emacs-22, the "deutsch8" dictionary gives the "Ispell and its process have
different character maps" error, but the "deutch" dictionary just works.

Comment 20 Chip Coldwell 2006-08-01 17:38:04 EDT
Dealing with multiple character encodings is *hard*.

It seems that what emacs does when you spellcheck a file that is utf-8 encoded
is that it transcodes it to iso-8859-1 (latin-1) before sending it to the ispell
(really aspell masquerading as ispell) process.  This is done in the function
src/process.c:send_process.  If you try to specify utf-8 encoding instead, then
the regular expression used to find the word in the buffer fails; it seems that
the emacs regex engine cannot handle multibyte encodings in character classes.

I think I have a solution, which is to pass the "--encoding=iso8859-1" switch to
aspell.  So if you select the "german8" or "deutsch8" dictionaries, the regex
will find words in the UTF-8 buffer correctly (matching them against an
iso8859-1 regex, mind you; there must be another transcoding going on there),
and with the "--encoding=iso8859-1" switch, aspell 0.60 will do the right thing
with the data sent from emacs.

Note that what changed here between FC-4 and FC-5 was NOT emacs -- it was
aspell.  If you install an old aspell version on FC-5 (say 0.50.5) this bug
doesn't happen.

I put some packages up at

http://people.redhat.com/coldwell/bugs/emacs/197737/

could you please test and verify that this fixes the bug?

Thanks,

Chip

Comment 21 Daniel Hammer 2006-08-03 02:29:06 EDT
Hi Chip, great it does fix the problem. With the new packages the problem does
not appear anymore. Thanx for your help!
Comment 22 Chip Coldwell 2006-11-03 10:02:20 EST
Created attachment 140251 [details]
fix German spell checking for UTF-8 encoded buffers
Comment 23 Chip Coldwell 2006-11-06 11:17:28 EST
FC-5: fixed in version 21.4-15.1
devel: fixed in version 21.4-16
Comment 24 Daniel 2009-06-22 19:35:19 EDT
This bug appeared to show up again in F11.
Comment 25 Daniel 2009-06-22 19:50:28 EDT
Created attachment 349016 [details]
ispell/emacs problem showbug

ispell/emacs problem showbug

Note You need to log in before you can comment on or make changes to this bug.