Bug 212984 - Review Request: hunspell-ar - Arabic word list/dictionaries for OpenOffice
Summary: Review Request: hunspell-ar - Arabic word list/dictionaries for OpenOffice
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: Package Review
Version: rawhide
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Kevin Fenzi
QA Contact: Fedora Package Reviews List
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-10-30 14:41 UTC by Dan Kenigsberg
Modified: 2007-11-30 22:11 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-06-20 19:00:50 UTC
Type: ---
Embargoed:
kevin: fedora-review+
kevin: fedora-cvs+


Attachments (Terms of Use)
sample hunspell-ar src.rpm (3.18 MB, application/x-rpm)
2007-06-06 11:19 UTC, Caolan McNamara
no flags Details
sample hunspell-ar spec (1.32 KB, text/plain)
2007-06-06 11:20 UTC, Caolan McNamara
no flags Details
convert Buckwalter's Arabic encoding to UTF-8 (150 bytes, text/plain)
2007-06-06 14:17 UTC, Dan Kenigsberg
no flags Details
convert Buckwalter's Arabic encoding to UTF-8 (150 bytes, text/plain)
2007-06-06 14:20 UTC, Dan Kenigsberg
no flags Details
demo text, first line pasted from ar.wikipedia, the 2nd deliberatly wrong (32 bytes, text/plain)
2007-06-06 18:36 UTC, Caolan McNamara
no flags Details
new src.rpm (3.19 MB, application/x-rpm)
2007-06-06 19:13 UTC, Caolan McNamara
no flags Details
replacement hunspell-ar.spec (1.63 KB, text/plain)
2007-06-06 19:18 UTC, Caolan McNamara
no flags Details

Description Dan Kenigsberg 2006-10-30 14:41:40 UTC
Spec URL: http://ivrix.org.il/redhat/openoffice.org-langpack-ar-dict.spec
SRPM URL: http://ivrix.org.il/redhat/openoffice.org-langpack-ar-dict-1.2-1.src.rpm
Description: 
Provides the Arabic word list/dictionaries for OpenOffice.

Warning: this package is very big and slow to load.

Comment 1 Caolan McNamara 2006-11-06 10:10:58 UTC
nothing wrong with doing this, there's a cs_CZ one in extras already

Comment 2 Caolan McNamara 2007-05-16 12:42:49 UTC
please reconsider and do a hunspell-ar package instead of a
openoffice.org-langpack-ar-dict package. You can find many good examples, e.g.
the hunspell-es one is a good fit for a hunspell-ar package

Comment 3 Jason Tibbitts 2007-06-06 00:00:50 UTC
Any progress here?  I'm sure an Arabic dictionary for hunspell would be much
appreciated, but it probably would need to go in a separate review ticket.

Comment 4 Caolan McNamara 2007-06-06 07:04:35 UTC
Well, the purpose is to provide arabic dictionaries for OOo, so I think we
should do a hunspell-ar, not a openoffice.org-langpack-ar-dict because when
firefox moves to hunspell then it gets arabic spellchecking out of the box, and
anyway because now OOo isn't in /usr/lib/openoffice.org2.0 but in
%{_libdir}/openoffice.org (i.e. on /usr/lib64 on 64bit, and without 2.0 on all
platforms) so the above spec won't work anyway.

The other thing is that there is an Arabic dictionary listed at
http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
if it's better than the aspell conversion done here (Is it, I have no idea ?)
then it should be used instead, if the aspell conversion is better then it would
be best to submit a snapshot of the output to dnaber at openoffice org to sync
it with the wiki

Comment 5 Dan Kenigsberg 2007-06-06 09:35:09 UTC
http://hunspell.sourceforge.net/buckwalter_morphan_1_hunspell_patch.tar.gz
listed from the wiki has a potentioal of being more compact that the conversion
done here. However, I did not yet understand how to make it work (my hunspell
segfaults when run on that dictionary).

So, I agree that using hunspell everywhere is the Right Thing. I simply don't
know how to do it.

Comment 6 Caolan McNamara 2007-06-06 11:17:49 UTC
hmm, indeed ka-boom. A problem in hunspell I believe, so I have built
hunspell-1.1.5.3-4.fc8 which should resolve that problem.

Comment 7 Caolan McNamara 2007-06-06 11:19:42 UTC
Created attachment 156334 [details]
sample hunspell-ar src.rpm

Comment 8 Caolan McNamara 2007-06-06 11:20:32 UTC
Created attachment 156335 [details]
sample hunspell-ar spec

Comment 9 Caolan McNamara 2007-06-06 11:21:14 UTC
so, with hunspell-1.1.5.3-4.fc8 would the above hunspell-ar fit our needs here ?

Comment 10 Dan Kenigsberg 2007-06-06 14:17:58 UTC
Created attachment 156353 [details]
convert Buckwalter's Arabic encoding to UTF-8

Wow, you're as quick now as you've been with Hebrew...

It seems very nice, however it uses Buckwalter's idiosyncratic encoding! Which
is not very usefull as-is... I attach ad-hoc scripts that I wrote to recode it
all into UTF-8 Arabic. Please excuse me for having troubles with tr and UTF8,
and please make the script more presentable before publishing them.

I have one other problem with hunspell-ar: why are you linking the dictionary
into all Arabic-speaking nations? Shouldn't ar_EG be kept for future
Egypt-specific dictionary, etc?

PS, I envy Arabic. Hunspell loads it so quickly comparing to Hebrew. I should
learn Hunspell's affix compression systax, and try to use it for Hebrew, too.

Comment 11 Dan Kenigsberg 2007-06-06 14:20:31 UTC
Created attachment 156354 [details]
convert Buckwalter's Arabic encoding to UTF-8

Attachment 156353 [details] converts the wordlist to UTF8. The affix file is abit more
complex, and I used:

cat ar1.aff |grep -v '^.FX' >nonFX
cat ar1.aff |grep '^.FX'| cut -d ' ' -f 1-3 >FX123
cat ar1.aff |grep '^.FX'| cut -d ' ' -f 4 | ./btr >FX4
cat ar1.aff |grep '^.FX'| cut -d ' ' -f 5- >FX5
paste -d ' ' FX123 FX4 FX5 >FX
cat nonFX FX >ar.aff

Comment 12 Caolan McNamara 2007-06-06 14:52:57 UTC
Well, we need to link the dictionary to all the Arabic locales otherwise OOo
won't be able to know what dictionaries to use for a given Arabic locale when it
goes looking for a suitable dictionary. 

We do the same stuff for e.g English on the basis that English spoken in Ireland
and most other former British Empire colonies is the same as in the UK so
there's a single en_GB.dic and a link from en_IE.dic to it. We could very easily
just leave this one as arabic.dic/ar.dic and link the various arabic locales to
it, including ar_EG. I just decided to pick one of the locales at random to copy
the .dic to so as to avoid a "ugly" non-locale .dic name.

Perhaps selecting ar_LB.dic or ar_SD.dic as the output file, and linking the
others to it on the basis that the "Current popular corpora consist mainly of ..
the archives of the daily paper Al-Hayat" which wikipedia tells me originally
started in the Lebanon and is now Saudi-owned might make *slightly* more sense.

Do you want to update the hunspell-ar.spec with the required encoding
modification, and select whatever you prefer as the canonical name for the
dictionary, either an existing locale or just e.g. ar.dic

Comment 13 Dan Kenigsberg 2007-06-06 15:47:46 UTC
Keeping ar_EG and its gazillion aliases is fine, if that's what OO wants.

I don't really "want" to update the spec; I kinda hoped that you'd take the
scripts I attached and incorporate them nicely. But I guess I can do it (on my
own time scale).

Comment 14 Caolan McNamara 2007-06-06 18:36:59 UTC
Created attachment 156377 [details]
demo text, first line pasted from ar.wikipedia, the 2nd deliberatly wrong

Comment 15 Caolan McNamara 2007-06-06 19:13:44 UTC
Created attachment 156381 [details]
new src.rpm

Comment 16 Caolan McNamara 2007-06-06 19:18:18 UTC
Created attachment 156383 [details]
replacement hunspell-ar.spec

Looking at doc/readme where the encoding used is described I see that it
doesn't seem to actually fit directly into either iso-8859-6 or windows-1256,
so it would ideally need to be converted directly into utf-8, which is really
tricky with tr and friends in a little shell script.

On the other hand I see that Buckwalter is famous enough for this GPLed
http://www.andy-roberts.net/software/buckwalter2unicode to exist. So the above
.spec uses this and afterwards the demo arabic attachment spell checks correct
for the first line and incorrect for the second line as it should. 

Does the output now pass muster ?

Comment 17 Dan Kenigsberg 2007-06-06 19:57:59 UTC
Beautifully (but note that save few words, I cannot really understand Arabic).

Few notes: 
the sample src.rpm is based on a buggy spec file; the spec of attachement 156383
is fine.

It would be nice if IGNORE could be used to ignore Arabic diacritics (and I'd
say, Latin letters). I believe that the IGNORE directive was added for that
reason, but I failed to use it correctly when you told me about it.

Please remind me (not that it really matters): what in buckwalter encoding
cannot be expressed in iso-8859-6? does it actually appear in the lexicon?

Your choice of ar_LB instead of ar_EG amuses me slightly... But I won't enter
the minefield of Arab politics. As a command-line user, I'd vote for having a
simple ar.dic. But it does not really matter that much.

Does it all "just work" within OO? I cannot really check, having an old OO version.

Comment 18 Caolan McNamara 2007-06-07 07:58:07 UTC
U+0670 ARABIC LETTER SUPERSCRIPT ALEF
U+0671 ARABIC LETTER ALEF WASLA
U+06A4 ARABIC LETTER VEH
and a couple of others

but it does seem that he hasn't used those letters in the corpus. Still it looks
like the safest approach to use the conversion script.

Yup, it appears to "just work" in OOo, the above little two lines of Arabic are
considered respectively correct and in-correct in OOo.

So we do we go for here, would you like to maintain such a proposed hunspell-ar
package, or should I do it.

Comment 19 Caolan McNamara 2007-06-07 15:08:19 UTC
so, new rpmlint output, all other guidelines also adhered to.

Comment 20 Dan Kenigsberg 2007-06-09 20:32:34 UTC
Thanks for the great work.
I would not mind at all if you completely take over hunspell-ar. In fact, I even
prefer it that way (though it would be nice if you mention me in the spec).
However, I'm willing to be the one checking hunspell-ar in if this is required.

Comment 21 Caolan McNamara 2007-06-10 18:32:54 UTC
I'll take ownership then of the post-import package, add add the appropriate OOo
Requires to the langpack-ar and the additional kudus mentions.

Comment 22 Caolan McNamara 2007-06-17 11:59:12 UTC
New Package CVS Request
=======================
Package Name: hunspell-ar
Short Description: Arabic Hunspell dictionaries
Owners: caolanm
Branches: devel

Comment 23 Kevin Fenzi 2007-06-18 04:54:10 UTC
Although I am not sure if there is a formal guideline (perhaps there should be),
we typically don't allow someone to approve a package that they are then going
to maintain. 
Could you get someone else to do a quick review of this and approve it?
(I might be able to in the next few days if no one else is able...)

Comment 24 Dan Kenigsberg 2007-06-18 05:29:46 UTC
If I count, I already reviewed the package. And if it saves some beaurocracy I
volunteered to maintain it in the future.

Comment 25 Caolan McNamara 2007-06-18 07:35:10 UTC
ok, lost interest.

Comment 26 Kevin Fenzi 2007-06-19 03:56:48 UTC
Sorry to cause trouble here. It was not my intent.
This is a confusing review. The orig package that was submitted changed into an
entirely different one, and who is reviewing and who is submitting seems to have
changed in mid-review. 

So, in an effort to move things forward: Dan: would you like to submit and
maintain this package? If so, I would be happy to review this. 
Let me know. Can you repost the current spec and src.rpm? 

Comment 27 Dan Kenigsberg 2007-06-19 04:58:11 UTC
Sure. They sit at http://ivrix.org.il/redhat/hunspell-ar.spec and
http://ivrix.org.il/redhat/hunspell-ar-0.20060208-1.src.rpm respectively.

Comment 28 Kevin Fenzi 2007-06-19 22:25:42 UTC
ok, will try and get to a review later tonight.

One thing that stands out from a quick look is the version. 
It does match how all the other hunspell-* packages are versioned, but not
entirely sure it's correct according to the guidelines. Is it a snapshot? a
pre-release? post-release?

Will look more later tonight hopefully... 

Comment 29 Kevin Fenzi 2007-06-20 04:26:09 UTC
OK - Package meets naming and packaging guidelines
See below - Spec file matches base package name.
OK - Spec has consistant macro usage.
OK - Meets Packaging Guidelines.
OK - License (GPL)
OK - License field in spec matches
OK - License file included in package
OK - Spec in American English
OK - Spec is legible.
OK - Sources match upstream md5sum:
28c3b7dce962051d55b23ac5e51f264b  ./buckwalter_morphan_1_hunspell_patch.tar.gz
28c3b7dce962051d55b23ac5e51f264b  ./buckwalter_morphan_1_hunspell_patch.tar.gz.1
b3ccdcacb32e57e2653f3e16de707a3e  ./buckwalter2unicode.py
b3ccdcacb32e57e2653f3e16de707a3e  ./buckwalter2unicode.py.1
OK - BuildRequires correct
OK - Package has %defattr and permissions on files is good.
OK - Package has a correct %clean section.
OK - Package has correct buildroot
OK - Package is code or permissible content.
OK - Packages %doc files don't affect runtime.
OK - Package has rm -rf RPM_BUILD_ROOT at top of %install

OK - Package compiles and builds on at least one arch.
OK - Package has no duplicate files in %files.
OK - Package doesn't own any directories other packages own.
OK - Package owns all the directories it creates.
OK - No rpmlint output.
OK - final provides and requires are sane:

SHOULD Items:

OK - Should build in mock.
OK - Should build on all supported archs
OK - Should have dist tag
OK - Should package latest version

Issues:

1. The base source file doesn't match this package name, but it matches up with the
other hunspell dicts, so this is not an issue.

2. I was thinking the version didn't look right, but after checking the possible
options I think this will be ok, at least I can't see any better option with the
way upstream names releases.

So, no blockers I can see, so this is re-APPROVED.
Sorry for the delay and detour here.


Comment 30 Dan Kenigsberg 2007-06-20 04:45:20 UTC
Ok, so let me repost:

New Package CVS Request
=======================
Package Name: hunspell-ar
Short Description: Arabic Hunspell dictionaries
Owners: danken.ac.il
Branches: devel

Comment 31 Kevin Fenzi 2007-06-20 05:03:34 UTC
cvs done.


Note You need to log in before you can comment on or make changes to this bug.