Description of problem: Version-Release number of selected component (if applicable): ibus-table-chinese-cangjie-1.4.6-3.fc20 How reproducible: Every time Steps to Reproduce: 1. remove ibus-table-chinese-cangjie 2. remove ~/.ibus/tables/canjie5-user.db 3. reinstall ibus-table-chinese-cangjie 4. select ibus-table-chinese-cangjie5 input method 5. input bmr (月一口, the canonical encoding for the commonly character 同) sufficiently many times. Actual results: the preedit window 1.
Actual results: the preedit window 1.
The posted bug gets cut short due to an encoding problem. Actual results: the preedit window 1.#(U+26688) 2.膈c 3.#c 4. .... 9. 冨w with 1.#(U+26688) as the expected default character (here I put in a pound sigh # to mean that the unicode character are not valid traditional chinese characters which causes this page to be cut off). have to go to the next page to get the most common (by far) character 同 1.同 Expected results: The only possible correct behavior is that a sequence in any cangjie Chinese input method such as cangjie5 gives the most commonly used Chinese character with the encoding (e.g., 丢 for the sequence hgi) as the first in the listing, and hitting the spacebar at this point should result the input of that Chinese character. in the preedit window, 1.同 should be the default character Additional info: This bug is similar to Bug#1045832 There are two critical, fail-safe devices which must be included with any such input methods, especially open-source ones. 1. Any character which has that precise key sequence as the encoding must be listed ahead of any other character. That is, if 同 and U+26688 are the two (and the only two) Chinese characters whose encodings are "bmr", then input of bmr must list 同 and U+26688 as the first two characters in the preedit window for choice of characters, no matter how unlikely to appear 同 and might be deemed by the data table. That the encoding matches the input sequence must outweigh any other factor in the AI. 2. There must be included with the input method a manual override hot key that SELECT THE CURRENT INPUT CHOICE AND MAKE THAT PERSISTENT FOR ALL FOLLOWING APPEARANCES. All AIs have bugs, manual overrides are the most important features of any AI programs or algorithm. Cangjie, for those who do not know Chinese input methods, has as a critical advantage that it does not require a choice-of-input-characters. Each character has a single canonical sequence for which the typist need never ever look up to the screen to check if the correct character is selected. This item restores the single most important feature of the Cangjie input method. Further, there is clearly an integer overflow bug in that sometimes the most important character associated with an input sequence goes to the bottom of the pile. This issue is common and endemic to all Chinese input methods since Microsoft Zhuyin (known affectionately in Taiwan as "ㄅ半") and Microsoft new Zhuyin. However, with microsoft input methods, after a few iterations the correct character or sequence "resurfaces" and bobbles up to the top of the input queue again as the default choice. In ibus-table-canjie* the correct character simply cannot be "pulled up" to the top of the heap where it belongs, If linux programmers don't improve input methods for Chinese, linux will never be taken seriously in Mandarin speaking areas.
Please excuse my emotional words, I was annoyed at bugzilla eating my comments several times.
BTW you might want to also try ibus-cangjie.
Sorry I didn't read all three bugs carefully, but are bug 1058029, bug 1045832, and this one all the same bug?
(In reply to Bo-Yin Yang from comment #3) > Please excuse my emotional words, I was annoyed at bugzilla eating my > comments several times. I can understand that, it has hit me a few times as well that bugzilla cuts comments short at the first occurence of a character above the BMP (Unicode character with a code point > U+FFFF). I am on vacation until 2014-03-19, I’ll have a closer look at this bug when I am back.
By the way, what version of ibus-table do you have installed?
The problem is reproducible both with ibus-table-1.5.0.20130419-3.fc20.noarch and ibus-table-1.5.0.20140312-2.fc20.noarch i.e. it is not a problem I introduced by porting ibus-table to Python3 recently.
ibus-table-1.5.0.20140527-1.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140527-1.fc20
(In reply to Fedora Update System from comment #9) > ibus-table-1.5.0.20140527-1.fc20 has been submitted as an update for Fedora > 20. > https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140527-1.fc20 ibus-table-chinese and ibus-table-others need to be rebuild against ibus-table-1.5.0.20140527-1.fc20 because I changed the format of the database.
(In reply to Mike FABIAN from comment #10) > (In reply to Fedora Update System from comment #9) > > ibus-table-1.5.0.20140527-1.fc20 has been submitted as an update for Fedora > > 20. > > https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140527-1.fc20 > > ibus-table-chinese and ibus-table-others need to be rebuild against > ibus-table-1.5.0.20140527-1.fc20 because I changed the format of the > database. Here are all the update which work together because ibus-table-chinese and ibus-table-others have been rebuild against the new ibus-table: https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140527-1.fc20 https://admin.fedoraproject.org/updates/ibus-table-chinese-1.4.6-4.fc20 https://admin.fedoraproject.org/updates/ibus-table-others-1.3.0.20140512-2.fc20
I did a lot of changes and bug fixes to ibus table and cleaned up the code to be able to maintain and fix it better in future. I mentioned the new packages in comment#11: https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140527-1.fc20 https://admin.fedoraproject.org/updates/ibus-table-chinese-1.4.6-4.fc20 https://admin.fedoraproject.org/updates/ibus-table-others-1.3.0.20140512-2.fc20 Now I’ll go through your report point for point to check what has been improved and what is still broken: (In reply to Bo-Yin Yang from comment #2) > The posted bug gets cut short due to an encoding problem. > > Actual results: the preedit window > 1.#(U+26688) 2.膈c 3.#c 4. .... 9. 冨w > > with 1.#(U+26688) as the expected default character (here I put in a pound > sigh # to mean that the unicode character are not valid traditional chinese > characters which causes this page to be cut off). U+26688 is a valid Chinese character, but unfortunately bugzilla cuts comments at the first appearance of any character above U+FFFF. > have to go to the next page to get the most common (by far) character 同 > 1.同 > > Expected results: > The only possible correct behavior is that a sequence in any cangjie Chinese > input method such as cangjie5 gives the most commonly used Chinese character > with the encoding (e.g., 丢 for the sequence hgi) as the first in the > listing, and hitting the spacebar at this point should result the input of > that Chinese character. > > in the preedit window, > 1.同 should be the default character > Additional info: > This bug is similar to Bug#1045832 > > There are two critical, fail-safe devices which must be included with any > such input methods, especially open-source ones. > > 1. Any character which has that precise key sequence as the encoding must be > listed ahead of any other character. That is, if 同 and U+26688 are the two > (and the only two) Chinese characters whose encodings are "bmr", then input > of bmr must list 同 and U+26688 as the first two characters in the preedit > window for choice of characters, no matter how unlikely to appear 同 and > might be deemed by the data table. The new ibus-table behaves like that now, in case of typing “bmr”, the only two exact matches 同 and U+26688 are now always shown on top. > That the encoding matches the input sequence must outweigh any other > factor in the AI. I implemented this, exact matches are now *always* shown on top. But in case of 同 and U+26688, 同 is not necessarily the first one at the moment. In the source of the cangjie5.txt table, all characters have the exactly same system frequency of 1000, i.e. the sorting by the system frequency which ibus-table does, does not do anything for cangjie5. So which one comes first depends on which Chinese mode the user chooses *and* which one the user has selected most often previously (the number of times the user has selected a character is the “user frequency”). (Note: If you want to see the frequencies, which can be useful for debugging purposes, you can remove the comment character # in front of export IBUS_TABLE_DEBUG_LEVEL=1 in /usr/libexec/ibus-engine-table, then the system frequencies and the user frequencies for the candidates are shown in the lookup table) ibus-table has 5 Chinese modes: 1) Simplified Chinese 2) Traditional Chinese 3) Simplified Chinese before traditional 4) Traditional Chinese before simplified 5) All Chinese characters 同 U+540C is simplified Chinese. Unihan_Variants.txt from http://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip contains: U+540C kTraditionalVariant U+8855 U+8855 kSimplifiedVariant U+540C That means, 同 U+540C is a “used in SC only” character (see: http://www.unicode.org/reports/tr38/index.html#SCTC ) whose traditional variant is 衕 U+8855 (which has the cangjie5 code “hobrn”). A character which is used in both SC and TC like 台 U+53F0 has is listed as both SC and TC in Unihan_Variants.txt: U+53F0 kSimplifiedVariant U+53F0 U+53F0 kTraditionalVariant U+53F0 U+6AAF U+81FA U+98B1 U+6AAF kSimplifiedVariant U+53F0 U+81FA kSimplifiedVariant U+53F0 U+98B1 kSimplifiedVariant U+53F0 So 同 U+540C is a “used in SC only” character. The other character with the cangjie5 code “bmr”, U+26688 (http://www.zdic.net/z/9e/js/26688.htm) is not listed in Unihan_Variants.txt, which means it is not different between SC and TC, i.e. it counts as “used in both SC and TC”. So what happens when “bmr” is typed in the above Chinese modes 1)-5)?: Chinese mode 1) simplified Chinese: =================================== only characters which are used in simplified Chinese (and maybe in traditional Chinese as well) are shown, i.e. characters which are *only* used in traditional Chinese are not shown at all. As both 同 U+540C *and* U+26688 are considered to be “used in SC”, they are both shown. With “export IBUS_TABLE_DEBUG_LEVEL=1” the lookup table shows: 月一口 (1/9) 1. U+26688 1000 0 2. 同 1000 0 3. 膈 月 1000 0 ... more matches ... (Side note: On top of the lookup table there is “月一口 (1/9)”. For tables which define prompt characters, like cangjie 5, I display the prompt characters when the user types, i.e. I display “月一口” instead of “bmr”. “(1/9)” means that the first candidate out of 9 is selected, 9 is the number of candidates in the first page. On page down, more candidates are added and that number increases. In the item 3. in the lookup table, “膈 月” is shown. “膈” is the candidate. “月” is the key which is still missing from the user input to make this a complete match, i.e. the cangjie5 code for “膈” is “月一口月” (= “bmrb”)) The lookup table is sorted by the following keys in that order, if there is a tie, the next sort keys are tried: - exact matches on top - user frequency descending - system frequency descending - length of necessary user input for this character ascending So U+26688 “happens” to be the first match as the system frequency data for both is the same, the user frequency is the same (0, not yet used), and the length of the user input is also the same. After selecting 同 once, 1 will be added to its user frequency. Therefore, when typing “bmr” again, 同 will be on top and the lookup table will look like: 月一口 (1/9) 1. 同 1000 1 2. U+26688 1000 0 3. 膈 月 1000 0 ... more matches ... Chinese mode 2), traditional Chinese: ===================================== only characters which are used in traditional Chinese (and maybe in simplified Chinese as well) are shown, i.e. characters which are *only* used in simplified Chinese are not shown at all. As 同 U+540C is “only used in SC” according to Unihan_Variants.txt, 同 U+540C is never shown in the lookup table in Chinese mode 2), only U+26688 is shown (on to of course) when typing “bmr” in Chinese mode 2). Chinese mode 3), Simplified Chinese before traditional: ======================================================= After first sorting the candidates as described above, i.e. - exact matches on top - user frequency descending - system frequency descending - length of necessary user input for this character ascending The result is split into 2 lists: “a) exact matches” + “b) incomplete matches” and for both lists a) and b) the following is done: The characters which are *only* used in traditional Chinese are filtered out and then appended at the end of the candidate list. So the final result is a1) exact matches used in SC (and maybe in TC as well) a2) exact matches used *only* in TC b1) incomplete matches used in SC (and maybe in TC as well) b2) incomplete matches used *only* in TC As both 同 and U+26688 count as “used in SC”, Chinese mode 3) behaves the same as Chinese mode 1) as far as the order of these two characters is concerned. Chinese mode 4), Traditional Chinese before simplified: ======================================================= similar to Chinese mode 3), the resulting candidate list is: a1) exact matches used in TC (and maybe in SC as well) a2) exact matches used *only* in SC b1) incomplete matches used in TC (and maybe in SC as well) b2) incomplete matches used *only* in SC Here, as 同 is considered a “used *only* in SC” character, it will be in a2) whereas U+26688 is considered to be used in both and will be in a1). I.e. in Chinese mode 4), U+26688 will *always* be shown before 同, no matter how often 同 is selected because the “used *only* in SC” characters will always be in a2) in Chinese mode 4). Chinese mode 5) All Chinese characters: ======================================= When building the candidate list, it does not matter at all whether a character is SC or TC or both, characters are neither omitted from the candidate list nor is the order changed depending on SC or TC. I.e. all characters matching the input sequence from the cangie5 table are added to the candidate list and the sort order is only: - exact matches on top - user frequency descending - system frequency descending - length of necessary user input for this character ascending In case of typing “bmr”, this means that both 同 U+540C *and* U+26688 are shown and when neither of them has been selected before, the lookup table will look like: 月一口 (1/9) 1. U+26688 1000 0 2. 同 1000 0 3. 膈 月 1000 0 ... more matches ... Which happens to be the same as in Chinese mode 1) as far as these two characters are concerned. Selecting 同 will increase its user frequency by 1 and the next type “bmr” is typed, 同 will be on top. > 2. There must be included with the input method a manual override hot key > that SELECT THE CURRENT INPUT CHOICE AND MAKE THAT PERSISTENT FOR ALL > FOLLOWING APPEARANCES. All AIs have bugs, manual overrides are the most > important features of any AI programs or algorithm. I think the adding of 1 to the user frequency for each use does this. If one uses 同 but never U+26688, 同 will be on top after the first use already. If one uses both characters, the character which has been used most often will be on top. That seems reasonable to me. As you can see in my above sort order, user frequency is used as a sort key before system frequency, i.e. system frequency only matters if there is a tie in the user frequency. There is also the possibility to remove that manual override again: If the lookup table looks for example like 月一口 (1/9) 1. 同 1000 17 1. U+26688 1000 5 3. 膈 月 1000 0 ... more matches ... because 同 has been used 17 times and U+26688 has been used 5 times, one can type Alt+Number to reset the user frequency data for that character (*if* possible!). I.e. when typing Alt+1 in the above case, the user frequency data for 同 will be deleted which may cause 同 not to be on top anymore when “bmr” is typed next) If Alt+Number is typed for a character where there is no user data yet, for example Alt+3 in the above example, nothing happens. So it never hurts trying to type Alt+Number if something shows up too high in the lookup table, but it will only work when that candidate is at that high position because it has been used already. Using this Alt+Number, one can correct it if characters show on top only because they have been selected by mistake. > Cangjie, for those who do not know Chinese input methods, has as a critical > advantage that it does not require a choice-of-input-characters. Each > character has a single canonical sequence for which the typist need never > ever look up to the screen to check if the correct character is selected. > This item restores the single most important feature of the Cangjie input > method. > > Further, there is clearly an integer overflow bug in that sometimes the most > important character associated with an input sequence goes to the bottom of > the pile. I don’t think it had anything to do with integer overflow. But of course the old code has been broken in various ways, it did a very bad job in sorting the candidate list and remembering what the user typed often. I think my new code should be much better. While writing this, I got one more idea: My current sort keys for the lookup table - exact matches on top - user frequency descending - system frequency descending - length of necessary user input for this character ascending cannot distinguish between 同 and U+26688 when they have never been used because all of the above sort keys compare both characters the same. Both are exact matches for “bmr”, both have the user frequency 0 if they have not been used, both have the system frequency 1000. The last sort key “length of necessary user input” matters only for incomplete matches. The “obvious” and probably best way to have the most common of the exact matches appear on top even when first used would be to fix the system frequency data in the cangjie5.txt. I.e. change the lines bmr 同 1000 bmr U+26688 1000 in the source cangjie5.txt to bmr 同 1001 bmr U+26688 1000 But then somebody who really knows which characters are most common needs to go through cangjie5.txt and edit these frequencies (I cannot do this). I wonder whether the order the characters are listed in cangjie5.txt is just random or is significant. For example in there are 3 lines in the following order: bmso 豚 1000 bmso 冢 1000 bmso 䐁 1000 Does this order in cangjie5.txt have any meaning? Is 豚 the most common character among the characters typed “bmso”? And “冢” the second most common one? In this case my guess is that this is indeed the case. If somebody can confirm that this is always the case in cangjie5.txt, I can use the original order in the source to decide which character should be on top if they are all exact matches and the system frequency is the same. Can anybody tell me wether the order of the characters with the same input key sequence in the source of cangjie5.txt has anything to do with the frequency of use of the characters or not? Here is the source: https://github.com/definite/ibus-table-chinese/blob/master/tables/cangjie/cangjie5.txt *If* the order of the characters in that source file cangjie5.txt is random and meaningless, then I have a last ditch idea: I could add another sort key “Unicode code point” like this: - exact matches on top - user frequency descending - system frequency descending - Unicode code point - length of necessary user input for this character ascending That would probably help in the cases where there is only one character among the exact matches with a code point < U+FFFF because the characters with code points > U+FFFF are probably rare characters. > This issue is common and endemic to all Chinese input methods since > Microsoft Zhuyin (known affectionately in Taiwan as "ㄅ半") and Microsoft new > Zhuyin. However, with microsoft input methods, after a few iterations the > correct character or sequence "resurfaces" and bobbles up to the top of the > input queue again as the default choice. Now this should work in ibus-table! After using a character, it should be on top next time. > In ibus-table-canjie* the correct character simply cannot be "pulled > up" to the top of the heap where it belongs, If linux programmers > don't improve input methods for Chinese, linux will never be taken > seriously in Mandarin speaking areas. I hope I could improve it a bit. Please tell me if you have further ideas/wishes for improvement.
(In reply to Mike FABIAN from comment #12) > I wonder whether the order the characters are listed in cangjie5.txt > is just random or is significant. For example in there are 3 lines > in the following order: > > bmso 豚 1000 > bmso 冢 1000 > bmso 䐁 1000 > > Does this order in cangjie5.txt have any meaning? Is 豚 the most > common character among the characters typed “bmso”? And “冢” the > second most common one? In this case my guess is that this is indeed > the case. > > If somebody can confirm that this is always the case in cangjie5.txt, > I can use the original order in the source to decide which character > should be on top if they are all exact matches and the system frequency > is the same. Maybe the order in cangjie5.txt has no meaning or maybe it is sometimes wrong. Because in https://bugzilla.redhat.com/show_bug.cgi?id=1058029#c1 you write: Bo-Yin Yang> More errors: Bo-Yin Yang> 板 DHE So you want “dhe” to yield 板 as the first match. But cangjie5.txt contains: Line 7116> dhe 皮 1000 Line 7117> dhe 板 1000 So 板 comes second here. And with my current update of ibus-table, 皮 is displayed as the first match and 板 as the second (Chinese mode 5, “All Chinese characters”)
Dear Mike (Fabian): THANK YOU VERY MUCH. I looked at your changes and want to thank you very much for your comments and extensive effort. I think there are different kind of issues. (a) in some cases there is a bug in a databas. (b) in some cases there is a bug in a program. I am not sure what is the problem with 板, I think it is a case of (a) -- database bug, but I am sure that 同 U+540C is a case of (a) -- database bug, in particular, here this Unihan_Variants.txt document is simply wrong. I quote === 同 U+540C is simplified Chinese. Unihan_Variants.txt from http://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip contains: U+540C kTraditionalVariant U+8855 U+8855 kSimplifiedVariant U+540C That means, 同 U+540C is a “used in SC only” character (see: http://www.unicode.org/reports/tr38/index.html#SCTC ) whose traditional variant is 衕 U+8855 (which has the cangjie5 code “hobrn”). === Except that is wrong. Or rather, not quite right. 同 (U+540C, CJ5: bmr) is one of the most commonly used characters in Traditional or Simplified Chinese. Its meaning is "Same". 衕 (U+8855, CJ5: hobrn) is a very rarely used Traditional character meaning "Alley, lane". Simplified Chinese simplifies 衕 U+8855 to 同 U+540C, much like 後 (0x5F8C, CJ5: hovie, meaning behind, following, back) is a Traditional Chinese character simplified to 后 0x540E in Simplified Chinese, but 后 (0x540E, CJ5: hmr, meaning queen) is its own character in Traditional Chinese. A sanity check is the following: any character that appears in the Big5 encoding is always traditional Chinese. I think that in each case where no matter how many times I select a character, the frequency never moves it is an error where a character is incorrectly assessed to be not traditional Chinese. I.e. 板 is a similar problem to 同
Package ibus-table-1.5.0.20140527-1.fc20: * should fix your issue, * was pushed to the Fedora 20 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing ibus-table-1.5.0.20140527-1.fc20' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2014-6778/ibus-table-1.5.0.20140527-1.fc20 then log in and leave karma (feedback).
(In reply to Bo-Yin Yang from comment #14) > I think that in each case where no matter how many times I select a > character, the frequency never moves it is an error where a character is > incorrectly assessed to be not traditional Chinese. I.e. 板 is a similar > problem to 同 I’ll investigate a bit more about the characters mentioned above later. But independently from which characters are simplified and which are traditional your statement above gives me another idea: At the moment, in the two Chinese modes “Simplified Chinese first” and “Traditional Chinese first” which show all Chinese characters but change the order depending on whether a character is traditional, simplified or both, this changing of the order is done as the *last* step at the moment, thus overriding the user frequency. I.e. if 同 is classified (mistakenly?) as “*only* used in simplified Chinese”, it will be lowered in the lookup table below U+26688 *always* no matter how often the user selects 同. Maybe I should change this do give the user frequency higher priority?? I.e. change the order based on traditional versus simplified *only* if the user frequency for these characters is the same. That seems to make more sense to me. If the user frequency gets higher priority, then “traditional versus simplified” only determines the default order. But even when using the Chinese mode “Traditional Chinese first”, if the user could selects a simplified character often, this character will be preferred. If the user selects that character often, the users wish should probably respected and given higher priority over some database listing which characters are traditional and which are Chinese as such databases could contain bugs as well (as we have seen ...). So the users selection should probably be more important. Do you agree? (Of course bugs in the database listing traditional and simplified characters should be fixed nevertheless, but at least the user can override it then in the 2 modes “Simplified Chinese first” and “Traditional Chinese first”. In the modes which show only “Simplified Chinese” or only “Traditional Chinese”, we still have a serious problem because 同 will never show up at all in the “Traditional Chinese” mode if it is classified as only used in simplified Chinese. So of course we need to fix the database as well, nevertheless it seems a good idea to give the user frequency higher priority than the “simplified versus traditional”).
ibus-table-1.5.0.20140528-1.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/ibus-table-1.5.0.20140528-1.fc20
ibus-table-1.8.0-1.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/ibus-table-1.8.0-1.fc20
ibus-table-1.8.1-1.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/ibus-table-1.8.1-1.fc20
ibus-table-1.8.2-1.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/ibus-table-1.8.2-1.fc20
ibus-table-1.8.2-1.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.