857967 – simplified/traditional Chinese detection in ibus-table does not work well

Bug 857967 - simplified/traditional Chinese detection in ibus-table does not work well

Summary: simplified/traditional Chinese detection in ibus-table does not work well

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	ibus-table
Sub Component:
Version:	18
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Mike FABIAN
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-09-17 15:37 UTC by Mike FABIAN
Modified:	2013-02-13 04:29 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-02-13 04:27:23 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Debian BTS	679546	0	None	None	None	2012-09-17 16:30:32 UTC

Description Mike FABIAN 2012-09-17 15:37:50 UTC

See also:

http://code.google.com/p/ibus/issues/detail?id=1492\

“unable to write 晞 with any of the two Wubi input methods”

晞 (wubi code = JQDH) cannot be written.

Comment 1 Mike FABIAN 2012-09-17 16:35:03 UTC

Discussion with upstream author of ibus-table how to fix the problem:

1:00 PM me: Hi Yuwei!
 Yuwei YU: hi
  hi
1:01 PM me: http://code.google.com/p/ibus/issues/detail?id=1492 <- have you seen this?
  晞 cannot be typed. with wubi.
  I think the reason is that it is misdetected as traditional Chinese only.
1:02 PM Because that character is not in gb2312 but in big5hkscs.
  So the test conversion to gb2312 fails but the test conversion to big5hkscs succeeds in tabsqlitedb.py
1:03 PM  # first whether in gb2312
try:
tmp_phrase.encode('gb2312')
category |= 1
except:
if '〇'.decode('utf8') in tmp_phrase:
# we add '〇' into SC as well
category |= 1
# second check big5-hkscs
try:
tmp_phrase.encode('big5hkscs')
category |= 1 << 1
  ...
1:04 PM But 晞 is actually the same in simplified and traditional Chinese.
  So detecting it as traditional Chinese only seems wrong.
1:05 PM I think to fix this, the detection of simplified and traditional Chinese needs to be improved.
  I thought about using the Unihan database for this.
1:06 PM For example: http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E6%9D%B1
  東 contains:
  kSimplifiedVariant U+4E1C 东
  and the entry for 东 contains:
  kTraditionalVariant U+6771 東
1:07 PM The entry for 晞, http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E6%99%9E contains neither kSimplifiedVariant nor kTraditionalVariant.
1:08 PM My idea is that this can be used for a better detection of simplified and traditional Characters.
1:09 PM I.e. parse this data out of the plain text version of the Unihan database during build time of ibus-table and generate some hash table in python which contains the information whether a Chinese character is simplified, traditional or both.
 Yuwei YU: ok, a new method of encode detection is needed here
1:10 PM me: What do you think of the idea of using the data available in the Unihan database for this?
  The Unihan database seems to contain the necessary information.
 Yuwei YU: yes, we can try it
1:11 PM me: I can try to implement this.
  Do you think generating a hashtable from the Unihan database at build time of ibus-table is a good idea? I think I could probably do that.
1:12 PM Keys of the hash table would be all possible Chinese characters, values would contain whether it is simplified, traditional or both.
  The current detection code could then be replaced by iterating over the phrase and looking up all Chinese characters in the phrase in that hash table.
1:14 PM If the phrase contains only characters which are both simplified and traditional, then the phrase gets both bit 1 and bit 2 set.
  If a character is there which is only simplified, then the phrase gets only bit 1 set.
  If a character is there which is only traditional, the phrase gets bit 2 set.
1:15 PM If both a character which is exclusively simplified and a character which is exclusively traditional is in the same phrase, that is kind of weird but probably then the phrase would have to be classified as both as well, i.e. set bit 1 and bit 2.
	6 minutes
1:21 PM Yuwei YU: that's a good idea, but the phrases are not only single characters. At this point, you need to use loop to go through each character, and this would be much slower
1:22 PM maybe we can try to use gbk instead of gb2312
1:23 PM me: That won't really help because with gbk we would get the opposite problem.
1:24 PM For example, the character 晞 is in gbk and just replacing gb2312 with gbk in the current detection code would classify that character as simplified Chinese and one would not be able to type it anymore in traditional Chinese mode.
1:26 PM Yes, I thought about looping over all characters in the phrase.
1:27 PM That should be fast enough because phrases are not very long, lookup in such a simple hash table is fast and this detection is only done when creating the .db file, i.e. at build time and not while the user is typing. So this does not need to be extremely fast.
1:28 PM Yuwei YU: detection would be use in user's phrase as well
  it need speed
1:29 PM me: user’s phrase is what is in ~/.ibus/tables/ ?
 Yuwei YU: If I remenber correctly
1:30 PM me: 东 for example is convertible to both gbk and big5hkscs:
  $ echo -n 东 | iconv -f utf-8 -t big5hkscs
mfabian@ari:~
$ echo -n 东 | iconv -f utf-8 -t gbk
¶«mfabian@ari:~
$
  So that doesn't really tell you that the character is simplified Chinese.
1:31 PM I feel that trying to detect whether characters are simplified or traditional Chinese by test converting to legacy encodings cannot really work well.
1:33 PM Yuwei YU: yes, I see. so the method need improvement.
 me: My guess at the moment is that iterating over the characters in a phrase and checking a hash table should be plenty fast.
1:34 PM But I can be sure only when I really do it and benchmark it ...
1:35 PM Yuwei YU: it can be faster than encode, unless you write it in c
  can't
1:39 PM me: About the user phrase thing, when is this used?
1:40 PM For example, if I delete all .db tables in ~/.ibus/tables, then restart ibus, then type something with wubi, then restart ibus again, then ~/.ibus/tables/wubi-jidian86-user.db contains the phrases I just typed.
  And when I type the same phrases again, they are preferred.
1:41 PM Are the user phrases you mentioned above “detection would be use in user's phrase as well” these phrases which get inserted in to ~/.ibus/tables/wubi-jidian86-user.db ?
1:42 PM Yuwei YU: yes, and user's new phrases are stored as well
 me: Then I wonder why I do not see debug messages from this simplfied/traditional detection when phrases get inserted into the user db.
1:43 PM I added 2 print statements:
  mfabian@ari:/usr/share/ibus-table/engine
$ diff u tabsqlitedb.py.~1~ tabsqlitedb.py
-- tabsqlitedb.py.~1~ 2012-09-13 15:51:30.000000000 +0200
+++ tabsqlitedb.py 2012-09-17 13:39:05.504724080 +0200
@@ -475,6 +475,7 @@
user_freq = 0
# now we will set the category bits if this is chinese
if self._is_chinese:
+ print "mike simplified traditional detection"
# this is the bitmask we will use,
# from low to high, 1st bit is simplify Chinese,
# 2nd bit is traditional Chinese,
@@ -676,6 +677,7 @@
This method is called in table.py by passing UserInput held data
Return result[:]
'''
+ print "mike in select_words"
# firstly, we make sure the len we used is equal or less than the max key length
_len = min( len(tabkeys),self._mlen )
_condition = ''
mfabian@ari:/usr/share/ibus-table/engine
$
 Yuwei YU: new phrase, not the already known
1:44 PM me: Then deleted all user .db files, restarted ibus, typed something, restarted ibus.
1:45 PM Yuwei YU: you type something, but use left shift to form a new phrase
 me: The phrases I typed are in the user .db after doing that.
  Ah, and I saw the debug messages now.
  So you are right, this detection code is run when inserting phrases into the user .db.
1:49 PM I probably missed the debug messages because I retyped phrases which already were in the user .db.
1:50 PM Yuwei YU: yes
	5 minutes
1:56 PM me: http://libunihan.sourceforge.net/ is interesting.
1:58 PM libunihan was apparently created to solve https://bugzilla.redhat.com/show_bug.cgi?id=227792 which looks similar to our problem.
2:03 PM Yuwei YU: yes
 me: Thinking about using libunihan ...
2:04 PM Yuwei YU: a good idea

Comment 2 Fedora Update System 2013-01-28 21:08:49 UTC

ibus-table-1.5.0-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/ibus-table-1.5.0-1.fc17

Comment 3 Fedora Update System 2013-01-28 21:33:18 UTC

ibus-table-1.5.0-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/ibus-table-1.5.0-1.fc18

Comment 4 Fedora Update System 2013-01-30 00:38:54 UTC

Package ibus-table-1.5.0-1.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing ibus-table-1.5.0-1.fc18'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-1596/ibus-table-1.5.0-1.fc18
then log in and leave karma (feedback).

Comment 5 Fedora Update System 2013-02-13 04:27:26 UTC

ibus-table-1.5.0-1.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 6 Fedora Update System 2013-02-13 04:29:49 UTC

ibus-table-1.5.0-1.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.