2125163 – Compose support in Gtk3 and Gtk4 does not support special Arabic compose sequences

Bug 2125163 - Compose support in Gtk3 and Gtk4 does not support special Arabic compose sequences

Summary: Compose support in Gtk3 and Gtk4 does not support special Arabic compose sequ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	gtk4
Sub Component:
Version:	36
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kalev Lember
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-09-08 07:44 UTC by Mike FABIAN
Modified:	2023-01-12 08:21 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2023-01-12 07:44:07 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
GNOME Gitlab	GNOME gtk issues 5172	0	None	opened	input: Allow single-key compose sequences	2022-09-12 16:07:31 UTC

Description Mike FABIAN 2022-09-08 07:44:34 UTC

The compose support in ibus has the same problem, see: https://bugzilla.redhat.com/show_bug.cgi?id=2125153

The following compose sequences do not work in ibus:

    $ grep -i arabic.*ligature /usr/share/X11/locale/en_US.UTF-8/Compose
    # Arabic Lam-Alef ligatures
    <UFEFB>	:   "لا" # ARABIC LIGATURE LAM WITH ALEF
    <UFEF7>	:   "لأ" # ARABIC LIGATURE LAM WITH ALEF WITH HAMZA ABOVE
    <UFEF9>	:   "لإ" # ARABIC LIGATURE LAM WITH ALEF WITH HAMZA BELOW
    <UFEF5>	:   "لآ" # ARABIC LIGATURE LAM WITH ALEF WITH MADDA ABOVE

They are needed because the Arabic keyboard layout outputs UFEFB on some key:

    $ grep -i fefb /usr/share/X11/xkb/symbols/ara 
        key <AB05> {  [           UFEFB,                UFEF5,                  NoSymbol,            NoSymbol ]};  // ‎ﻻ‎ ‎ﻵ‎
        key <AB05> {  [           UFEFB,                UFEF5,                     U06AB,               U06AD ]};  // ‎ﻻ‎ ‎ﻵ‎     ‎ګ‎ ‎ڭ‎

but the UFEFB characters is not what is desired, what one really wants is U+0644 U+0627. But xkb keyboard layouts can only output one keysym when typing a key, not two. So compose was used as a hack to work around this limitation of xkb:

The keyboard produces UFEFB and then the compose support replaces  this with U+0644 U+0627.

This works when the compose support in Xorg is used but not when the compose support in ibus is used.

How to reproduce:

1) First show that it works when using the Xorg compose support:

Start xterm like this (to disable ibus and use the Xorg compose support):

    env XMODIFIERS=@im=none xterm &

Then in the xterm, type

    $ echo -n b | iconv -f utf8 -t utf16le | od -x
    0000000 0062
    0000002

and we see that the b produces U+0062, which is correct.

Switch to the Arabic keyboard

    setxkbmap  ara

type “arrow up” to get the echo -n b | iconv -f utf8 -t utf16le | od -x line back, go back to the b with “arrow left”, type b and now one gets:

    echo -n لا | iconv -f utf8 -t utf16le | od -x 
    0000000 0644 0627
    0000004

I.e. even though the keyboard surely outputs only U+FEFB, the Compose support of Xorg transforms this into U+0644 U+0627

2) Now try to do this in gedit (Gtk3) and gnome-text-editor (Gtk4) using the Gtk compose support

   Gtk3: start gedit like this:
   
       env GTK_IM_MODULE=gtk-im-context-simple gedit

   Gtk4: start gnome-text-editor like this:

       env GTK_IM_MODULE=gtk-im-context-simple gnome-text-editor

Switch to the Arabic keyboard layout and type the key which has the label `b` on the US English layout.

In gedit and gnome-text-editor you get the character ﻻ U+FEFB ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM

You can see that this is one character by moving the cursor over the character and you see that only one cursor step is needed when moving over the character and when typing Backspace the whole character goes away at once.

The correct result would be the sequence ل U+0644 ARABIC LETTER LAM ا U+0627 ARABIC LETTER ALEF which looks like: لا

I.e. this looks very similar to ﻻ U+FEFB ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM but it is two characters instead of one, you notice that when moving the cursor over it (two arrow keys are needed to move over that sequence) and when deleting with one Backspace, you delete only ا U+0627 ARABIC LETTER ALEF and ل U+0644 ARABIC LETTER LAM remains.

I tried to look at the code for Compose support in Gtk4 and did not understand why it does not work. My guess is, that only Compose sequences starting with either Multi_key or a dead key work and Compose sequences starting with other characters do not work. But I could not understand when browsing the code why it is like that.

But using the above Arabic example it is easy to reproduce that it does not work.

Comment 1 Mike FABIAN 2022-09-08 17:11:25 UTC

Related old bugs:

https://bugzilla.gnome.org/show_bug.cgi?id=537457 Bug 537457 - Support compose sequences that produce two+ codepoints 
https://gitlab.gnome.org/GNOME/gtk/-/issues/186 Support compose sequences that produce two+ codepoints

Comment 2 Mike FABIAN 2022-09-09 15:55:53 UTC

For ibus-typing-booster I fixed it like this (was surprisingly easy):

https://github.com/mike-fabian/ibus-typing-booster/issues/379
https://github.com/mike-fabian/ibus-typing-booster/commit/c788401c794843a6b99c91a51f9cb67b32ffc86e

I just had to allow other keys than Multi_key and dead keys to start a sequence and I had to add 0x01000000 to when calculating the key value of keysyms of the type <UXXXX>, that was all.

I hope it is equally easy in Gtk and ibus ...

Comment 3 Mike FABIAN 2022-09-11 09:07:00 UTC

<UFEFB> <UFEFB>	:   "لا" # ARABIC LIGATURE LAM WITH ALEF

This sequence actually works with ibus.

Doesn’t make sense of course because it makes you press the 'b' key twice to get the desired result of U+0644 U+0627.

But it shows the the problem in ibus is not that it cannot handle sequence starting with other keys than Multi_key and dead keys and the problems is also not that it cannot handle <UXXXX> keysyms.

The problem in ibus seems to be that sequences which consist of only one single keysym don’t work.

https://github.com/ibus/ibus/blob/main/src/ibuscomposetable.c#L198

has:


    if (g_strv_length (words) < 2) {
        g_warning ("too few words; key sequence format is <a> <b>...: %s",
                   line);
        goto fail;
    }


Hm, maybe that is why sequences of length 1 don’t work?

Comment 4 Mike FABIAN 2022-09-11 17:33:24 UTC

No, 

   if (g_strv_length (words) < 2) { 

is not the reason why sequences of length 1 don't work. I didn’t read carefully:

    char **words = g_strsplit (seq, "<", -1);
    int i;
    int n = 0;

    if (g_strv_length (words) < 2) {

So g_strv_length (words) is already >= 2 if there is at least one "<" in the sequence.

Comment 5 Mike FABIAN 2022-09-12 07:56:40 UTC

I found that the following workaround works for Gtk3 (and ibus) but **not** for Gtk4:

- create a ~/.XCompose file (and make sure ~/.config/gtk-3.0/Compose does not exist so ~/.XCompose is really read!) with the following contents:

    $ cat ~/.XCompose
    include "/%L"
    <UFEFB>	:   "لا" # ARABIC LIGATURE LAM WITH ALEF

- Now try to reproduce as described in comment#0 : https://bugzilla.redhat.com/show_bug.cgi?id=2125163#c0

The bug is now "fixed" for Gtk3 and ibus (but not for Gtk4).

Comment 6 Mike FABIAN 2022-09-12 11:35:28 UTC

Fujiwara San found this suspicious line in Gtk4:

https://gitlab.gnome.org/GNOME/gtk/-/blob/main/gtk/gtkcomposetable.c#L507


      if (sequence[1] == 0)
        {
          remove_sequence = TRUE;
          goto next;
        }


Looks like this is removing compose sequences with a length of only 1 key.

Comment 7 Matthias Clasen 2022-09-12 15:18:51 UTC

Yes, sequences of length 1 are not useful. Except for this hack that I knew nothing about. And it only works recently, since I added support for multi-character results.

Comment 8 Matthias Clasen 2022-09-12 15:28:15 UTC

Really ugly to use Compose sequences to patch up deficiencies in Unicode and xkb. If all you have is a hammer...

Comment 9 Matthias Clasen 2022-09-12 16:00:21 UTC

Its not trivial to fix either, because the table compression we use relies on splitting off the first key from the rest of the sequence. For single-key sequences, there's no rest...

Comment 10 Mike FABIAN 2022-09-12 18:38:56 UTC

(In reply to Matthias Clasen from comment #7)
> Yes, sequences of length 1 are not useful. Except for this hack that I knew
> nothing about. And it only works recently, since I added support for
> multi-character results.

In the X11 compose support it works as well, apparently for a long time already.

Comment 11 Mike FABIAN 2022-09-12 18:46:16 UTC

(In reply to Matthias Clasen from comment #8)
> Really ugly to use Compose sequences to patch up deficiencies in Unicode and
> xkb. If all you have is a hammer...

Yes, I agree. I was extremely surprised when I discovered this hack. 

When I first looked at the original bug that the Arabic keyboard doesn’t work well:

https://bugzilla.redhat.com/show_bug.cgi?id=2122899

I thought that this was unfixable because xkb can only output one char per keystroke. 

https://freedesktop.org/wiki/Software/XKeyboardConfig/XKB2Dreams/ (Random collection of ideas related to XKB2)

mentions:

     3. upport for scenarios "multiple keypresses - one keysym" and "single keypress - multiple keysyms". 

But that is a "dream" which might never be really implemented. Very unlikely any time soon I guess.

So as I thought there was no way to fix this,  I suggested using the ar-kbd input methods with ibus-m17n instead as a workaround.

But that has its own problems:

- Does not help if the Arabic user uses an AZERTY variant of the Arabic layout
- typing Western digits 0, 1, 2, ... becomes impossible, only Arabic digits can be typed with ar-kbd (The Arabic xkb layouts can type both Western and Arabic digits)

Therefore, when I accidentally discovered this hack, I thought it was better to make this hack work than to change the default input methods for all Arabic locales.

Comment 12 Matthias Clasen 2022-09-13 15:50:11 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=2125163

Comment 13 Matthias Clasen 2022-09-13 15:50:52 UTC

Grr, mispaste. What I wanted to say:

Fixed upstream in gtk4

Comment 14 fujiwara 2023-01-12 07:44:07 UTC

I think this is fixed in Fedora 37.

Comment 15 Mike FABIAN 2023-01-12 07:46:56 UTC

I tested
https://download.copr.fedorainfracloud.org/results/fujiwara/ibus/fedora-rawhide-x86_64/05219560-ibus/ibus-1.5.27-11.1.fc38.x86_64.rpm
in a Fedora rawhide VM and it fixes the problem.

Comment 16 Mike FABIAN 2023-01-12 08:18:59 UTC

Now I also tested

https://download.copr.fedorainfracloud.org/results/fujiwara/ibus/fedora-37-x86_64/05219560-ibus/ibus-1.5.27-11.1.fc37.x86_64.rpm

on Fedora 37, it fixes the problem there as well.

Comment 17 Mike FABIAN 2023-01-12 08:21:51 UTC

comment#15 and comment#16 were intended for https://bugzilla.redhat.com/show_bug.cgi?id=2125153

ibus had the same problem with Arabic compose sequences, now it is fixed in ibus as well.

Note You need to log in before you can comment on or make changes to this bug.