1050753 – (ibus-table-AI-problem) table corrupted for certain chinese characters -- possibly integer overflow problem

Bug 1050753 (ibus-table-AI-problem) - table corrupted for certain chinese characters -- possibly integer overflow problem

Summary: table corrupted for certain chinese characters -- possibly integer overflow p...

Keywords:
Status:	CLOSED ERRATA
Alias:	ibus-table-AI-problem
Product:	Fedora
Classification:	Fedora
Component:	ibus-table-chinese
Sub Component:
Version:	20
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Mike FABIAN
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-01-09 02:23 UTC by Bo-Yin Yang
Modified:	2014-06-12 06:30 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ibus-table-1.8.2-1.fc20
Clone Of:
Environment:
Last Closed:	2014-06-12 06:30:12 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Bo-Yin Yang 2014-01-09 02:23:07 UTC

Description of problem:


Version-Release number of selected component (if applicable):
ibus-table-chinese-cangjie-1.4.6-3.fc20

How reproducible:
Every time

Steps to Reproduce:
1. remove ibus-table-chinese-cangjie
2. remove ~/.ibus/tables/canjie5-user.db
3. reinstall ibus-table-chinese-cangjie
4. select ibus-table-chinese-cangjie5 input method
5. input bmr (月一口, the canonical encoding for the commonly character 同)
   sufficiently many times.

Actual results: the preedit window
1.

Comment 1 Bo-Yin Yang 2014-01-09 02:27:05 UTC

Actual results: the preedit window
1.

Comment 2 Bo-Yin Yang 2014-01-09 02:32:47 UTC

The posted bug gets cut short due to an encoding problem.

Actual results: the preedit window
1.#(U+26688) 2.膈c 3.#c 4. .... 9. 冨w

with 1.#(U+26688) as the expected default character (here I put in a pound sigh # to mean that the unicode character are not valid traditional chinese characters which causes this page to be cut off).

have to go to the next page to get the most common (by far) character 同
1.同

Expected results:
The only possible correct behavior is that a sequence in any cangjie Chinese input method such as cangjie5 gives the most commonly used Chinese character with the encoding (e.g., 丢 for the sequence hgi) as the first in the listing, and hitting the spacebar at this point should result the input of that Chinese character.

in the preedit window,
1.同 should be the default character

Additional info:
This bug is similar to Bug#1045832

There are two critical, fail-safe devices which must be included with any such input methods, especially open-source ones.

1. Any character which has that precise key sequence as the encoding must be listed ahead of any other character. That is, if 同 and U+26688 are the two (and the only two) Chinese characters whose encodings are "bmr", then input of bmr must list 同 and U+26688 as the first two characters in the preedit window for choice of characters, no matter how unlikely to appear 同 and might be deemed by the data table. That the encoding matches the input sequence must outweigh any other factor in the AI.

2. There must be included with the input method a manual override hot key that SELECT THE CURRENT INPUT CHOICE AND MAKE THAT PERSISTENT FOR ALL FOLLOWING APPEARANCES. All AIs have bugs, manual overrides are the most important features of any AI programs or algorithm.

Cangjie, for those who do not know Chinese input methods, has as a critical advantage that it does not require a choice-of-input-characters. Each character has a single canonical sequence for which the typist need never ever look up to the screen to check if the correct character is selected. This item restores the single most important feature of the Cangjie input method.

Further, there is clearly an integer overflow bug in that sometimes the most important character associated with an input sequence goes to the bottom of the pile.

This issue is common and endemic to all Chinese input methods since Microsoft Zhuyin (known affectionately in Taiwan as "ㄅ半") and Microsoft new Zhuyin. However, with microsoft input methods, after a few iterations the correct character or sequence "resurfaces" and bobbles up to the top of the input queue again as the default choice. In ibus-table-canjie* the correct character simply cannot be "pulled up" to the top of the heap where it belongs, If linux programmers don't improve input methods for Chinese, linux will never be taken seriously in Mandarin speaking areas.

Comment 3 Bo-Yin Yang 2014-01-10 00:21:33 UTC

Please excuse my emotional words, I was annoyed at bugzilla eating my comments several times.

Comment 4 Jens Petersen 2014-03-19 06:29:04 UTC

BTW you might want to also try ibus-cangjie.

Comment 5 Jens Petersen 2014-03-19 06:49:31 UTC

Sorry I didn't read all three bugs carefully,
but are bug 1058029, bug 1045832, and this one all the same bug?

Comment 6 Mike FABIAN 2014-03-19 06:56:56 UTC

(In reply to Bo-Yin Yang from comment #3)
> Please excuse my emotional words, I was annoyed at bugzilla eating my
> comments several times.

I can understand that, it has hit me a few times as well that bugzilla
cuts comments short at the first occurence of a character above the BMP
(Unicode character with a code point > U+FFFF).

I am on vacation until 2014-03-19, I’ll have a closer look at this bug when I am back.

Comment 7 Mike FABIAN 2014-03-19 07:02:35 UTC

By the way, what version of ibus-table do you have installed?

Comment 8 Mike FABIAN 2014-03-19 08:34:42 UTC

The problem is reproducible both with

ibus-table-1.5.0.20130419-3.fc20.noarch

and

ibus-table-1.5.0.20140312-2.fc20.noarch

i.e. it is not a problem I introduced by porting ibus-table
to Python3 recently.

Comment 9 Fedora Update System 2014-05-27 06:47:08 UTC

ibus-table-1.5.0.20140527-1.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140527-1.fc20

Comment 10 Mike FABIAN 2014-05-27 06:52:35 UTC

(In reply to Fedora Update System from comment #9)
> ibus-table-1.5.0.20140527-1.fc20 has been submitted as an update for Fedora
> 20.
> https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140527-1.fc20

ibus-table-chinese and ibus-table-others need to be rebuild against
ibus-table-1.5.0.20140527-1.fc20 because I changed the format of the
database.

Comment 11 Mike FABIAN 2014-05-27 08:04:39 UTC

(In reply to Mike FABIAN from comment #10)
> (In reply to Fedora Update System from comment #9)
> > ibus-table-1.5.0.20140527-1.fc20 has been submitted as an update for Fedora
> > 20.
> > https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140527-1.fc20
> 
> ibus-table-chinese and ibus-table-others need to be rebuild against
> ibus-table-1.5.0.20140527-1.fc20 because I changed the format of the
> database.

Here are all the update which work together because ibus-table-chinese
and ibus-table-others have been rebuild against the new ibus-table:

https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140527-1.fc20

https://admin.fedoraproject.org/updates/ibus-table-chinese-1.4.6-4.fc20

https://admin.fedoraproject.org/updates/ibus-table-others-1.3.0.20140512-2.fc20

Comment 12 Mike FABIAN 2014-05-27 18:04:46 UTC

I did a lot of changes and bug fixes to ibus table and cleaned up
the code to be able to maintain and fix it better in future.

I mentioned the new packages in comment#11:


https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140527-1.fc20

https://admin.fedoraproject.org/updates/ibus-table-chinese-1.4.6-4.fc20

https://admin.fedoraproject.org/updates/ibus-table-others-1.3.0.20140512-2.fc20

Now I’ll go through your report point for point to check
what has been improved and what is still broken:

(In reply to Bo-Yin Yang from comment #2)
> The posted bug gets cut short due to an encoding problem.
> 
> Actual results: the preedit window
> 1.#(U+26688) 2.膈c 3.#c 4. .... 9. 冨w
> 
> with 1.#(U+26688) as the expected default character (here I put in a pound
> sigh # to mean that the unicode  character are not valid traditional chinese
> characters which causes this page to be cut off).

U+26688 is a valid Chinese character, but unfortunately bugzilla cuts
comments at the first appearance of any character above U+FFFF.

> have to go to the next page to get the most common (by far) character 同
> 1.同 
> 
> Expected results: 
> The only possible correct behavior is that a sequence in any cangjie Chinese
> input method such as cangjie5 gives the most commonly used Chinese character
> with the encoding (e.g., 丢 for the sequence hgi) as the first in the
> listing, and hitting the spacebar at this point should result the input of
> that Chinese character.  
> 
> in the preedit window,
> 1.同 should be the default character

> Additional info:
> This bug is similar to Bug#1045832
> 
> There are two critical, fail-safe devices which must be included with any
> such input methods, especially open-source ones.
> 
> 1. Any character which has that precise key sequence as the encoding must be
> listed ahead of any other character. That is, if 同 and U+26688 are the two
> (and the only two) Chinese characters whose encodings are "bmr", then input
> of bmr must list 同 and U+26688 as the first two characters in the preedit
> window for choice of characters, no matter how unlikely to appear 同 and 
> might be deemed by the data table.

The new ibus-table behaves like that now, in case of typing “bmr”,
the only two exact matches 同 and U+26688 are now always shown on top.

> That the encoding matches the input sequence must outweigh any other
> factor in the AI.

I implemented this, exact matches are now *always* shown on top.

But in case of 同 and U+26688, 同 is not necessarily the first one
at the moment. In the source of the cangjie5.txt table, all characters
have the exactly same system frequency of 1000, i.e. the sorting by
the system frequency which ibus-table does, does not do anything for
cangjie5.

So which one comes first depends on which Chinese mode the user
chooses *and* which one the user has selected most often previously
(the number of times the user has selected a character is the “user
frequency”).

(Note: If you want to see the frequencies, which can be useful
for debugging purposes, you can remove the comment character # in front of 

    export IBUS_TABLE_DEBUG_LEVEL=1

in /usr/libexec/ibus-engine-table, then the system frequencies and
the user frequencies for the candidates are shown in the lookup table)

ibus-table has 5 Chinese modes:

1) Simplified Chinese
2) Traditional Chinese
3) Simplified Chinese before traditional
4) Traditional Chinese before simplified
5) All Chinese characters

同 U+540C is simplified Chinese. Unihan_Variants.txt
from http://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip
contains:

    U+540C	kTraditionalVariant	U+8855
    U+8855	kSimplifiedVariant	U+540C

That means, 同 U+540C is a “used in SC only” character (see:
http://www.unicode.org/reports/tr38/index.html#SCTC ) whose
traditional variant is 衕 U+8855 (which has the cangjie5 code
“hobrn”).

A character which is used in both SC and TC like 台 U+53F0 has
is listed as both SC and TC in Unihan_Variants.txt:

    U+53F0	kSimplifiedVariant	U+53F0
    U+53F0	kTraditionalVariant	U+53F0 U+6AAF U+81FA U+98B1
    U+6AAF	kSimplifiedVariant	U+53F0
    U+81FA	kSimplifiedVariant	U+53F0
    U+98B1	kSimplifiedVariant	U+53F0

So 同 U+540C is a “used in SC only” character.

The other character with the cangjie5 code “bmr”, U+26688
(http://www.zdic.net/z/9e/js/26688.htm) is not listed in
Unihan_Variants.txt, which means it is not different between SC and
TC, i.e. it counts as “used in both SC and TC”.

So what happens when “bmr” is typed in the above Chinese modes 1)-5)?:

Chinese mode 1) simplified Chinese:
===================================

only characters which are used in simplified Chinese (and maybe in
traditional Chinese as well) are shown, i.e. characters which are
*only* used in traditional Chinese are not shown at all.

As both 同 U+540C *and* U+26688 are considered to be “used in SC”,
they are both shown. With “export IBUS_TABLE_DEBUG_LEVEL=1” the lookup
table shows:

    月一口 (1/9)
    1. U+26688 1000 0
    2. 同      1000 0
    3. 膈 月    1000 0
    ... more matches ...

(Side note: On top of the lookup table there is “月一口 (1/9)”.  For
tables which define prompt characters, like cangjie 5, I display the
prompt characters when the user types, i.e. I display “月一口”
instead of “bmr”. “(1/9)” means that the first candidate out of 9
is selected, 9 is the number of candidates in the first page. On page
down, more candidates are added and that number increases. In the item
3. in the lookup table, “膈 月” is shown. “膈” is the candidate.
“月” is the key which is still missing from the user input to make
this a complete match, i.e. the cangjie5 code for “膈” is “月一口月”
(= “bmrb”))

The lookup table is sorted by the following keys in that order,
if there is a tie, the next sort keys are tried:

   - exact matches on top
   - user frequency descending
   - system frequency descending
   - length of necessary user input for this character ascending

So U+26688 “happens” to be the first match as the system frequency data
for both is the same, the user frequency is the same (0, not yet used),
and the length of the user input is also the same.

After selecting 同 once, 1 will be added to its user frequency. Therefore,
when typing “bmr” again, 同 will be on top and the lookup table
will look like:

    月一口 (1/9)
    1. 同      1000 1
    2. U+26688 1000 0
    3. 膈 月    1000 0
    ... more matches ...

Chinese mode 2), traditional Chinese:
=====================================

only characters which are used in traditional Chinese (and maybe in
simplified Chinese as well) are shown, i.e. characters which are
*only* used in simplified Chinese are not shown at all.

As 同 U+540C is “only used in SC” according to Unihan_Variants.txt,
同 U+540C is never shown in the lookup table in Chinese mode 2), only
U+26688 is shown (on to of course) when typing “bmr” in Chinese mode
2).

Chinese mode 3), Simplified Chinese before traditional:
=======================================================

After first sorting the candidates as described above, i.e.

   - exact matches on top
   - user frequency descending
   - system frequency descending
   - length of necessary user input for this character ascending

The result is split into 2 lists: “a) exact matches” + “b) incomplete matches”
and for both lists a) and b) the following is done:

The characters which are *only* used in traditional Chinese are
filtered out and then appended at the end of the
candidate list. So the final result is

    a1) exact matches used in SC (and maybe in TC as well)
    a2) exact matches used *only* in TC
    b1) incomplete matches used in SC (and maybe in TC as well)
    b2) incomplete matches used *only* in TC

As both 同 and U+26688 count as “used in SC”, Chinese mode 3)
behaves the same as Chinese mode 1) as far as the order of these two
characters is concerned.

Chinese mode 4), Traditional Chinese before simplified:
=======================================================

similar to Chinese mode 3), the resulting candidate list is:

    a1) exact matches used in TC (and maybe in SC as well)
    a2) exact matches used *only* in SC
    b1) incomplete matches used in TC (and maybe in SC as well)
    b2) incomplete matches used *only* in SC

Here, as 同 is considered a “used *only* in SC” character, it will
be in a2) whereas U+26688 is considered to be used in both and will be
in a1). I.e. in Chinese mode 4), U+26688 will *always* be shown before
同, no matter how often 同 is selected because the “used *only* in
SC” characters will always be in a2) in Chinese mode 4).

Chinese mode 5) All Chinese characters:
=======================================

When building the candidate list, it does not matter at all whether a
character is SC or TC or both, characters are neither omitted from the
candidate list nor is the order changed depending on SC or TC.

I.e. all characters matching the input sequence from the cangie5 table
are added to the candidate list and the sort order is only:

   - exact matches on top
   - user frequency descending
   - system frequency descending
   - length of necessary user input for this character ascending

In case of typing “bmr”, this means that both 同 U+540C *and*
U+26688 are shown and when neither of them has been selected before,
the lookup table will look like:

    月一口 (1/9)
    1. U+26688 1000 0
    2. 同      1000 0
    3. 膈 月    1000 0
    ... more matches ...
    
Which happens to be the same as in Chinese mode 1) as far as these
two characters are concerned. Selecting 同 will increase its user
frequency by 1 and the next type “bmr” is typed, 同 will be on top.

> 2. There must be included with the input method a manual override hot key
> that SELECT THE CURRENT INPUT CHOICE AND MAKE THAT PERSISTENT FOR ALL
> FOLLOWING APPEARANCES.  All AIs have bugs, manual overrides are the most
> important features of any AI programs or algorithm.

I think the adding of 1 to the user frequency for each use does this.
If one uses 同 but never U+26688, 同 will be on top after the first use
already. If one uses both characters, the character which has been
used most often will be on top. That seems reasonable to me.

As you can see in my above sort order, user frequency is used as a
sort key before system frequency, i.e. system frequency only matters
if there is a tie in the user frequency.

There is also the possibility to remove that manual override again:

If the lookup table looks for example like

    月一口 (1/9)
    1. 同      1000 17
    1. U+26688 1000 5
    3. 膈 月    1000 0
    ... more matches ...

because 同 has been used 17 times and U+26688 has been used 5 times,
one can type Alt+Number to reset the user frequency data for that
character (*if* possible!). I.e. when typing Alt+1 in the above case,
the user frequency data for 同 will be deleted which may cause 同 not
to be on top anymore when “bmr” is typed next)

If Alt+Number is typed for a character where there is no user data
yet, for example Alt+3 in the above example, nothing happens.  So it
never hurts trying to type Alt+Number if something shows up too high
in the lookup table, but it will only work when that candidate is at
that high position because it has been used already. Using this
Alt+Number, one can correct it if characters show on top only because
they have been selected by mistake.

> Cangjie, for those who do not know Chinese input methods, has as a critical
> advantage that it does not require a choice-of-input-characters.  Each
> character has a single canonical sequence for which the typist need never
> ever look up to the screen to check if the correct character is selected. 
> This item restores the single most important feature of the Cangjie input
> method.
> 
> Further, there is clearly an integer overflow bug in that sometimes the most
> important character associated with an input sequence goes to the bottom of
> the pile.  

I don’t think it had anything to do with integer overflow. But of
course the old code has been broken in various ways, it did a very bad
job in sorting the candidate list and remembering what the user typed
often. I think my new code should be much better.

While writing this, I got one more idea:

My current sort keys for the lookup table

   - exact matches on top
   - user frequency descending
   - system frequency descending
   - length of necessary user input for this character ascending

cannot distinguish between 同 and U+26688 when they have never been
used because all of the above sort keys compare both characters the
same. Both are exact matches for “bmr”, both have the user frequency
0 if they have not been used, both have the system frequency 1000. The
last sort key “length of necessary user input” matters only for
incomplete matches.

The “obvious” and probably best way to have the most common of the
exact matches appear on top even when first used would be to fix the
system frequency data in the cangjie5.txt. I.e. change the lines

    bmr	同	1000
    bmr	U+26688	1000

in the source cangjie5.txt to

    bmr	同	1001
    bmr	U+26688	1000

But then somebody who really knows which characters are most common
needs to go through cangjie5.txt and edit these frequencies (I cannot
do this).

I wonder whether the order the characters are listed in cangjie5.txt
is just random or is significant. For example in there are 3 lines
in the following order:

    bmso	豚	1000
    bmso	冢	1000
    bmso	䐁	1000

Does this order in cangjie5.txt have any meaning? Is 豚 the most
common character among the characters typed “bmso”? And “冢” the
second most common one? In this case my guess is that this is indeed
the case.

If somebody can confirm that this is always the case in cangjie5.txt,
I can use the original order in the source to decide which character
should be on top if they are all exact matches and the system frequency
is the same.

Can anybody tell me wether the order of the characters with the same
input key sequence in the source of cangjie5.txt has anything to do
with the frequency of use of the characters or not? Here is the source:

https://github.com/definite/ibus-table-chinese/blob/master/tables/cangjie/cangjie5.txt

*If* the order of the characters in that source file cangjie5.txt
is random and meaningless, then I have a last ditch idea:

I could add another sort key “Unicode code point” like this:

   - exact matches on top
   - user frequency descending
   - system frequency descending
   - Unicode code point
   - length of necessary user input for this character ascending

That would probably help in the cases where there is only one
character among the exact matches with a code point < U+FFFF because
the characters with code points > U+FFFF are probably rare characters.

> This issue is common and endemic to all Chinese input methods since
> Microsoft Zhuyin (known affectionately in Taiwan as "ㄅ半") and Microsoft new
> Zhuyin.  However, with microsoft input methods, after a few iterations the
> correct character or sequence "resurfaces" and bobbles up to the top of the
> input queue again as the default choice.

Now this should work in ibus-table! After using a character, it should
be on top next time.

> In ibus-table-canjie* the correct character simply cannot be "pulled
> up" to the top of the heap where it belongs, If linux programmers
> don't improve input methods for Chinese, linux will never be taken
> seriously in Mandarin speaking areas.

I hope I could improve it a bit. Please tell me if you have further
ideas/wishes for improvement.

Comment 13 Mike FABIAN 2014-05-27 20:51:09 UTC

(In reply to Mike FABIAN from comment #12)

> I wonder whether the order the characters are listed in cangjie5.txt
> is just random or is significant. For example in there are 3 lines
> in the following order:
> 
>     bmso	豚	1000
>     bmso	冢	1000
>     bmso	䐁	1000
> 
> Does this order in cangjie5.txt have any meaning? Is 豚 the most
> common character among the characters typed “bmso”? And “冢” the
> second most common one? In this case my guess is that this is indeed
> the case.
> 
> If somebody can confirm that this is always the case in cangjie5.txt,
> I can use the original order in the source to decide which character
> should be on top if they are all exact matches and the system frequency
> is the same.

Maybe the order in cangjie5.txt has no meaning or maybe it
is sometimes wrong. Because in

https://bugzilla.redhat.com/show_bug.cgi?id=1058029#c1

you write:

Bo-Yin Yang> More errors:
Bo-Yin Yang> 板	DHE

So you want “dhe” to yield 板 as the first match.
But cangjie5.txt contains:

Line 7116> dhe	皮	1000
Line 7117> dhe	板	1000

So 板 comes second here.

And with my current update of ibus-table, 皮 is displayed as the first
match and 板 as the second (Chinese mode 5, “All Chinese
characters”)

Comment 14 Bo-Yin Yang 2014-05-28 01:37:43 UTC

Dear Mike (Fabian):

THANK YOU VERY MUCH.  I looked at your changes and want to thank you very much for your comments and extensive effort.

I think there are different kind of issues.

(a) in some cases there is a bug in a databas.
(b) in some cases there is a bug in a program.

I am not sure what is the problem with 板, I think it is a case of (a) -- database bug, but I am sure that 同 U+540C is a case of (a) -- database bug, in particular, here this Unihan_Variants.txt document is simply wrong.  

I quote
===
同 U+540C is simplified Chinese. Unihan_Variants.txt
from http://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip
contains:

    U+540C     	kTraditionalVariant	U+8855
    U+8855	kSimplifiedVariant	U+540C

That means, 同 U+540C is a “used in SC only” character (see:
http://www.unicode.org/reports/tr38/index.html#SCTC ) whose
traditional variant is 衕 U+8855 (which has the cangjie5 code
“hobrn”).
===
Except that is wrong.  Or rather, not quite right.

同 (U+540C, CJ5: bmr) is one of the most commonly used characters in Traditional or Simplified Chinese.  Its meaning is "Same".

衕 (U+8855, CJ5: hobrn) is a very rarely used Traditional character meaning "Alley, lane". Simplified Chinese simplifies 衕 U+8855 to 同 U+540C, much like 

後 (0x5F8C, CJ5: hovie, meaning behind, following, back) is a Traditional Chinese character simplified to 后 0x540E in Simplified Chinese, but 后 (0x540E, CJ5: hmr, meaning queen) is its own character in Traditional Chinese.

A sanity check is the following: any character that appears in the Big5 encoding is always traditional Chinese.

I think that in each case where no matter how many times I select a character, the frequency never moves it is an error where a character is incorrectly assessed to be not traditional Chinese.  I.e. 板 is a similar problem to 同

Comment 15 Fedora Update System 2014-05-28 02:57:45 UTC

Package ibus-table-1.5.0.20140527-1.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing ibus-table-1.5.0.20140527-1.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-6778/ibus-table-1.5.0.20140527-1.fc20
then log in and leave karma (feedback).

Comment 16 Mike FABIAN 2014-05-28 06:08:57 UTC

(In reply to Bo-Yin Yang from comment #14)

> I think that in each case where no matter how many times I select a
> character, the frequency never moves it is an error where a character is
> incorrectly assessed to be not traditional Chinese.  I.e. 板 is a similar
> problem to 同

I’ll investigate a bit more about the characters mentioned above later.

But independently from which characters are simplified and which are
traditional your statement above gives me another idea:

At the moment, in the two Chinese modes “Simplified Chinese first”
and “Traditional Chinese first” which show all Chinese characters
but change the order depending on whether a character is traditional,
simplified or both, this changing of the order is done as the *last*
step at the moment, thus overriding the user frequency.  I.e. if 同 is
classified (mistakenly?) as “*only* used in simplified Chinese”,
it will be lowered in the lookup table below U+26688 *always*
no matter how often the user selects 同.

Maybe I should change this do give the user frequency higher
priority??

I.e. change the order based on traditional versus simplified
*only* if the user frequency for these characters is the same.

That seems to make more sense to me.

If the user frequency gets higher priority, then “traditional versus
simplified” only determines the default order. But even when using
the Chinese mode “Traditional Chinese first”, if the user could
selects a simplified character often, this character will be
preferred. If the user selects that character often, the users wish
should probably respected and given higher priority over some database
listing which characters are traditional and which are Chinese as such
databases could contain bugs as well (as we have seen ...).
So the users selection should probably be more important.

Do you agree?

(Of course bugs in the database listing traditional and simplified
characters should be fixed nevertheless, but at least the
user can override it then in the 2 modes “Simplified Chinese first”
and “Traditional Chinese first”. In the modes which show
only “Simplified Chinese” or only “Traditional Chinese”, we still
have a serious problem because 同 will never show up at all
in the “Traditional Chinese” mode if it is classified as only used
in simplified Chinese. So of course we need to fix the database as well,
nevertheless it seems a good idea to give the user frequency higher
priority than the “simplified versus traditional”).

Comment 17 Fedora Update System 2014-05-28 21:12:27 UTC

ibus-table-1.5.0.20140528-1.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140528-1.fc20

Comment 18 Fedora Update System 2014-06-03 06:04:09 UTC

ibus-table-1.8.0-1.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/ibus-table-1.8.0-1.fc20

Comment 19 Fedora Update System 2014-06-04 15:56:29 UTC

ibus-table-1.8.1-1.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/ibus-table-1.8.1-1.fc20

Comment 20 Fedora Update System 2014-06-09 06:10:21 UTC

ibus-table-1.8.2-1.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/ibus-table-1.8.2-1.fc20

Comment 21 Fedora Update System 2014-06-12 06:30:12 UTC

ibus-table-1.8.2-1.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.