Bug 1373404

Summary: unmunch produces weird results for some dictionaries
Product: [Fedora] Fedora Reporter: Mike FABIAN <mfabian>
Component: hunspellAssignee: Caolan McNamara <caolanm>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 25CC: bjoern, caolanm, mfabian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-12 10:17:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Mike FABIAN 2016-09-06 08:07:35 UTC
Tested on Fedora-Workstation-netinst-x86_64-25-Alpha-1.2.iso

[mfabian@Fedora-Workstation-netinst-x86_6 ~]$ rpm -q hunspell
hunspell-1.4.1-1.fc25.x86_64
[mfabian@Fedora-Workstation-netinst-x86_6 ~]$ rpm -q hunspell-de
hunspell-de-0.20160407-1.fc25.noarch
[mfabian@Fedora-Workstation-netinst-x86_6 ~]$ cat /etc/fedora-release 
Fedora release 25 (Twenty Five)
[mfabian@Fedora-Workstation-netinst-x86_6 ~]$ 

https://github.com/hunspell/hunspell/blob/master/README says:

    “unmunch: list all recognized words of a MySpell dictionary”

When using unmunch with the German dictionary, I get stuff like:

[mfabian@Fedora-Workstation-netinst-x86_6 ~]$ unmunch /usr/share/myspell/de_DE.dic /usr/share/myspell/de_DE.aff 2>/dev/null | grep -a '^Agent$' -A 40
Agent
Agentin
Agentinnen
AgentIn
AgentInnen
Agenten
-agentin
-agentinnen
-agentIn
-agentInnen
-agenten
-agent
Agenten
Agenten0/xoc|
Agenten-/zocf|
Agenten-/cz|
Agentinnen/xyoc|
AgentInnen/xyoc|
Agentinnen/xyocf|
AgentInnen/xyocf|
Agentinnen-/cz|
AgentInnen-/cz|
-/coyf|Agenten0/xoc|
-/coyf|Agenten-/zocf|
-/coyf|Agenten-/cz|
-/coyf|Agentinnen/xyoc|
-/coyf|AgentInnen/xyoc|
-/coyf|Agentinnen/xyocf|
-/coyf|AgentInnen/xyocf|
-/coyf|Agentinnen-/cz|
-/coyf|AgentInnen-/cz|
-/coyf|Agenten
Agentur
Agenturen
Agentur0/xoc|
Agentur-/zocf|
Agentur-/cz|
-/coyf|Agenturen
-agenturen
-/coyf|Agentur0/xoc|
-agentur0/xoc|
[mfabian@Fedora-Workstation-netinst-x86_6 ~]$

This looks OK:

Agent
Agentin
Agentinnen
AgentIn
AgentInnen
Agenten

I am not sure about this:

-agentin
-agentinnen
-agentIn
-agentInnen
-agenten
-agent

And what is this?:

Agenten0/xoc|
Agenten-/zocf|
Agenten-/cz|
Agentinnen/xyoc|
AgentInnen/xyoc|
Agentinnen/xyocf|
AgentInnen/xyocf|
Agentinnen-/cz|
AgentInnen-/cz|
-/coyf|Agenten0/xoc|
-/coyf|Agenten-/zocf|
-/coyf|Agenten-/cz|
-/coyf|Agentinnen/xyoc|
-/coyf|AgentInnen/xyoc|
-/coyf|Agentinnen/xyocf|
-/coyf|AgentInnen/xyocf|
-/coyf|Agentinnen-/cz|
-/coyf|AgentInnen-/cz|
-/coyf|Agenten

These does not look like “recognized words of a MySpell dictionary”.

The original de_DE.dic contains:

Agenten/ghij

And looking into de_DE.aff for the “i” and “j” flags, I find:

PFX i Y 1
PFX i 0 -/coyf .

SFX j Y 3
SFX j 0 0/xoc .
SFX j 0 -/zocf .
SFX j 0 -/cz .

I don’t understand these prefix rules for “i” and suffix rules for “j”.

So is this a bug in “unmunch” or is it a bug in the de_DE.{dic,aff} files?

Comment 1 Mike FABIAN 2016-09-06 08:18:43 UTC
https://github.com/hunspell/hunspell/blob/master/README also says:

    “wordforms: word generation (Hunspell version of unmunch)”

For comparison, here is what “wordforms” produces for “Agent” using
de_DE.aff and de_DE.dic:

[mfabian@Fedora-Workstation-netinst-x86_6 ~]$ cd /usr/share/myspell/
[mfabian@Fedora-Workstation-netinst-x86_6 myspell]$ wordforms de_DE.aff de_DE.dic Agent
Agent
-agent
Agentinnen
Agenten
Agentin
AgentIn
AgentInnen
Agent
Agent
Agenten
Agentinnen
Agenten
Agentin
AgentIn
AgentInnen
Agent
Agent
Agenten
-agentinnen
-agenten
-agentin
-agentIn
-agentInnen
-agent
-agent
-agenten
[mfabian@Fedora-Workstation-netinst-x86_6 myspell]$ pwd
/usr/share/myspell
[mfabian@Fedora-Workstation-netinst-x86_6 myspell]$ 

This looks better.

Comment 2 Mike FABIAN 2016-09-06 08:20:53 UTC
By the way, “wordforms” seems to work only when the current
directory is the directory where the dictionaries are. 
In comment#1, “wordforms” was executed in /usr/share/myspell/.

Trying to execute it in other directories fails:

[mfabian@Fedora-Workstation-netinst-x86_6 ~]$ pwd
/home/mfabian
[mfabian@Fedora-Workstation-netinst-x86_6 ~]$ wordforms /usr/share/myspell/de_DE.aff /usr/share/myspell/de_DE.dic Agent
awk: fatal: cannot open file `/tmp/wordforms.aff' for reading (No such file or directory)
Can't open affix or dictionary files for dictionary named "/tmp/wordforms".
[mfabian@Fedora-Workstation-netinst-x86_6 ~]$

Comment 3 Björn Jacke 2016-09-14 08:55:07 UTC
if the description of unmunch is right and this is for MySpell, then the broken results must be expected as the i and j flags in the hunspell dictionary are based on a hunspell-only feature. To avoid confusion Debian for example installs the hunspell dictionaries not in /usr/share/myspell/ but in /usr/share/hunspell/. 

The result of wordforms from comment #1 looks reasonable.

Comment 4 Mike FABIAN 2016-09-14 09:20:12 UTC
(In reply to Björn Jacke from comment #3)
> if the description of unmunch is right and this is for MySpell, then the
> broken results must be expected as the i and j flags in the hunspell
> dictionary are based on a hunspell-only feature. To avoid confusion Debian
> for example installs the hunspell dictionaries not in /usr/share/myspell/
> but in /usr/share/hunspell/. 
> 
> The result of wordforms from comment #1 looks reasonable.

But unmunch is distributed with hunspell now and we don’t even have a myspell
in fedora.

If unmunch is now part of hunspell, it should work correctly for hunspell
dictionaries, shouldn’t it? Otherwise it is quite useless.

Comment 5 Mike FABIAN 2016-09-14 09:21:30 UTC
From: Björn Jacke <bjoern>
Subject: Re: [Bug 1373404] unmunch produces weird results for some dictionaries
To: bugzilla, mfabian
Date: Tue, 13 Sep 2016 18:40:58 +0200 (16 hours, 36 minutes, 2 seconds ago)

hi mike, the i and j flags are for leading and trailing dashes on composable words.

required for example in "Versicherungsberater und -agenten"

cheers
björn

Comment 6 Fedora End Of Life 2017-11-16 19:23:33 UTC
This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 7 Fedora End Of Life 2017-12-12 10:17:58 UTC
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.