Bug 488203

Summary: Multi-segment input: -てください ("-te kudasai") form incorrectly split for godan verbs.
Product: [Fedora] Fedora Reporter: Peter Gordon <peter>
Component: anthyAssignee: Akira TAGOH <tagoh>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 10CC: i18n-bugs, tagoh
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-03-03 07:40:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Peter Gordon 2009-03-03 06:32:14 UTC
[I wasn't quite sure where to file this; as I don't know if it's an issue with SCIM's handling of the Romaji input or if it's Anthy's handling of Kana-parsing, et al. Please reassign it as appropriate if necessary.]

Using the てください form of a verb is not properly seen in Anthy as its one (or two) segments. With multi-segment input, it parses as three seperate things the verb root, the てくだ, and the さい. 

However, it seems to correctly keep it as one big segment if a sentence-final  particle (such as よ or ね) is appended.

For example, 頑張(がんば)ってください is parsed as がんばっ・てくだ・さい but if you add a particle to it, such as ね, it is properly seen as one big phrase: "頑張ってくださいね". (Here, "・" denotes where the seperations happen as Anthy splits it automatically.)

I've noticed that this also happens with other verbs, but only if they are of godan conjugations. Ichidan verbs, such as 食べる(たべる), 見る(みる), and 上げる(あげる), exhibit normal behavior here.

For example, 話(はな)してください is seen as はなし・てくだ・さい instead of はなしてください (or perhaps 話して・ください) but 話してくださいよ - again with that final particle - properly is seen as one larger segment, "はなしてください".

---

NEVRAs of related packages:
scim-libs-1.4.7-35.fc10.x86_64
scim-anthy-1.2.7-1.fc10.x86_64
scim-lang-japanese-1.4.7-35.fc10.x86_64
scim-bridge-gtk-0.4.15-8.fc10.x86_64
scim-bridge-0.4.15-8.fc10.x86_64
scim-tomoe-0.6.0-5.fc10.x86_64
scim-gtk-1.4.7-35.fc10.x86_64
scim-1.4.7-35.fc10.x86_64
anthy-9100h-1.fc10.x86_64

Comment 1 Akira TAGOH 2009-03-03 07:40:25 UTC
The above examples works for me on fresh install. I guess you might have any kind of bad learning data. try again after disable IM with im-chooser and remove $HOME/.anthy and enable then.

Generally a conversion result could be broken in any IMEs with even commercial IMEs unexceptionally if it learns an input and a segment etc that isn't commonly used. as a result, they basically have a feature to not learn them automatically but do that when they want to. scim-anthy has this feature too.
Aside from that, I could patch the corpus data out to make a specific-segment a priority for you though, it might affects others if it won't happens commonly, i.e. on anthy with a clean dictionary. so if it's the case, I'm afraid I can't fix that.

Otherwise please feel free to file a kind of this bug then.

Comment 2 Peter Gordon 2009-03-04 08:44:45 UTC
Akira-san, I just removed the ~/.anthy directory as you suggested and that did indeed fix the problem for me; which means it probably was something weird that I had inadvertently trained it with. 

Thanks very much, and many apologies for the bug-spam. :)