Spec URL: http://ivrix.org.il/redhat/openoffice.org-langpack-ar-dict.spec SRPM URL: http://ivrix.org.il/redhat/openoffice.org-langpack-ar-dict-1.2-1.src.rpm Description: Provides the Arabic word list/dictionaries for OpenOffice. Warning: this package is very big and slow to load.
nothing wrong with doing this, there's a cs_CZ one in extras already
please reconsider and do a hunspell-ar package instead of a openoffice.org-langpack-ar-dict package. You can find many good examples, e.g. the hunspell-es one is a good fit for a hunspell-ar package
Any progress here? I'm sure an Arabic dictionary for hunspell would be much appreciated, but it probably would need to go in a separate review ticket.
Well, the purpose is to provide arabic dictionaries for OOo, so I think we should do a hunspell-ar, not a openoffice.org-langpack-ar-dict because when firefox moves to hunspell then it gets arabic spellchecking out of the box, and anyway because now OOo isn't in /usr/lib/openoffice.org2.0 but in %{_libdir}/openoffice.org (i.e. on /usr/lib64 on 64bit, and without 2.0 on all platforms) so the above spec won't work anyway. The other thing is that there is an Arabic dictionary listed at http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29 if it's better than the aspell conversion done here (Is it, I have no idea ?) then it should be used instead, if the aspell conversion is better then it would be best to submit a snapshot of the output to dnaber at openoffice org to sync it with the wiki
http://hunspell.sourceforge.net/buckwalter_morphan_1_hunspell_patch.tar.gz listed from the wiki has a potentioal of being more compact that the conversion done here. However, I did not yet understand how to make it work (my hunspell segfaults when run on that dictionary). So, I agree that using hunspell everywhere is the Right Thing. I simply don't know how to do it.
hmm, indeed ka-boom. A problem in hunspell I believe, so I have built hunspell-1.1.5.3-4.fc8 which should resolve that problem.
Created attachment 156334 [details] sample hunspell-ar src.rpm
Created attachment 156335 [details] sample hunspell-ar spec
so, with hunspell-1.1.5.3-4.fc8 would the above hunspell-ar fit our needs here ?
Created attachment 156353 [details] convert Buckwalter's Arabic encoding to UTF-8 Wow, you're as quick now as you've been with Hebrew... It seems very nice, however it uses Buckwalter's idiosyncratic encoding! Which is not very usefull as-is... I attach ad-hoc scripts that I wrote to recode it all into UTF-8 Arabic. Please excuse me for having troubles with tr and UTF8, and please make the script more presentable before publishing them. I have one other problem with hunspell-ar: why are you linking the dictionary into all Arabic-speaking nations? Shouldn't ar_EG be kept for future Egypt-specific dictionary, etc? PS, I envy Arabic. Hunspell loads it so quickly comparing to Hebrew. I should learn Hunspell's affix compression systax, and try to use it for Hebrew, too.
Created attachment 156354 [details] convert Buckwalter's Arabic encoding to UTF-8 Attachment 156353 [details] converts the wordlist to UTF8. The affix file is abit more complex, and I used: cat ar1.aff |grep -v '^.FX' >nonFX cat ar1.aff |grep '^.FX'| cut -d ' ' -f 1-3 >FX123 cat ar1.aff |grep '^.FX'| cut -d ' ' -f 4 | ./btr >FX4 cat ar1.aff |grep '^.FX'| cut -d ' ' -f 5- >FX5 paste -d ' ' FX123 FX4 FX5 >FX cat nonFX FX >ar.aff
Well, we need to link the dictionary to all the Arabic locales otherwise OOo won't be able to know what dictionaries to use for a given Arabic locale when it goes looking for a suitable dictionary. We do the same stuff for e.g English on the basis that English spoken in Ireland and most other former British Empire colonies is the same as in the UK so there's a single en_GB.dic and a link from en_IE.dic to it. We could very easily just leave this one as arabic.dic/ar.dic and link the various arabic locales to it, including ar_EG. I just decided to pick one of the locales at random to copy the .dic to so as to avoid a "ugly" non-locale .dic name. Perhaps selecting ar_LB.dic or ar_SD.dic as the output file, and linking the others to it on the basis that the "Current popular corpora consist mainly of .. the archives of the daily paper Al-Hayat" which wikipedia tells me originally started in the Lebanon and is now Saudi-owned might make *slightly* more sense. Do you want to update the hunspell-ar.spec with the required encoding modification, and select whatever you prefer as the canonical name for the dictionary, either an existing locale or just e.g. ar.dic
Keeping ar_EG and its gazillion aliases is fine, if that's what OO wants. I don't really "want" to update the spec; I kinda hoped that you'd take the scripts I attached and incorporate them nicely. But I guess I can do it (on my own time scale).
Created attachment 156377 [details] demo text, first line pasted from ar.wikipedia, the 2nd deliberatly wrong
Created attachment 156381 [details] new src.rpm
Created attachment 156383 [details] replacement hunspell-ar.spec Looking at doc/readme where the encoding used is described I see that it doesn't seem to actually fit directly into either iso-8859-6 or windows-1256, so it would ideally need to be converted directly into utf-8, which is really tricky with tr and friends in a little shell script. On the other hand I see that Buckwalter is famous enough for this GPLed http://www.andy-roberts.net/software/buckwalter2unicode to exist. So the above .spec uses this and afterwards the demo arabic attachment spell checks correct for the first line and incorrect for the second line as it should. Does the output now pass muster ?
Beautifully (but note that save few words, I cannot really understand Arabic). Few notes: the sample src.rpm is based on a buggy spec file; the spec of attachement 156383 is fine. It would be nice if IGNORE could be used to ignore Arabic diacritics (and I'd say, Latin letters). I believe that the IGNORE directive was added for that reason, but I failed to use it correctly when you told me about it. Please remind me (not that it really matters): what in buckwalter encoding cannot be expressed in iso-8859-6? does it actually appear in the lexicon? Your choice of ar_LB instead of ar_EG amuses me slightly... But I won't enter the minefield of Arab politics. As a command-line user, I'd vote for having a simple ar.dic. But it does not really matter that much. Does it all "just work" within OO? I cannot really check, having an old OO version.
U+0670 ARABIC LETTER SUPERSCRIPT ALEF U+0671 ARABIC LETTER ALEF WASLA U+06A4 ARABIC LETTER VEH and a couple of others but it does seem that he hasn't used those letters in the corpus. Still it looks like the safest approach to use the conversion script. Yup, it appears to "just work" in OOo, the above little two lines of Arabic are considered respectively correct and in-correct in OOo. So we do we go for here, would you like to maintain such a proposed hunspell-ar package, or should I do it.
so, new rpmlint output, all other guidelines also adhered to.
Thanks for the great work. I would not mind at all if you completely take over hunspell-ar. In fact, I even prefer it that way (though it would be nice if you mention me in the spec). However, I'm willing to be the one checking hunspell-ar in if this is required.
I'll take ownership then of the post-import package, add add the appropriate OOo Requires to the langpack-ar and the additional kudus mentions.
New Package CVS Request ======================= Package Name: hunspell-ar Short Description: Arabic Hunspell dictionaries Owners: caolanm Branches: devel
Although I am not sure if there is a formal guideline (perhaps there should be), we typically don't allow someone to approve a package that they are then going to maintain. Could you get someone else to do a quick review of this and approve it? (I might be able to in the next few days if no one else is able...)
If I count, I already reviewed the package. And if it saves some beaurocracy I volunteered to maintain it in the future.
ok, lost interest.
Sorry to cause trouble here. It was not my intent. This is a confusing review. The orig package that was submitted changed into an entirely different one, and who is reviewing and who is submitting seems to have changed in mid-review. So, in an effort to move things forward: Dan: would you like to submit and maintain this package? If so, I would be happy to review this. Let me know. Can you repost the current spec and src.rpm?
Sure. They sit at http://ivrix.org.il/redhat/hunspell-ar.spec and http://ivrix.org.il/redhat/hunspell-ar-0.20060208-1.src.rpm respectively.
ok, will try and get to a review later tonight. One thing that stands out from a quick look is the version. It does match how all the other hunspell-* packages are versioned, but not entirely sure it's correct according to the guidelines. Is it a snapshot? a pre-release? post-release? Will look more later tonight hopefully...
OK - Package meets naming and packaging guidelines See below - Spec file matches base package name. OK - Spec has consistant macro usage. OK - Meets Packaging Guidelines. OK - License (GPL) OK - License field in spec matches OK - License file included in package OK - Spec in American English OK - Spec is legible. OK - Sources match upstream md5sum: 28c3b7dce962051d55b23ac5e51f264b ./buckwalter_morphan_1_hunspell_patch.tar.gz 28c3b7dce962051d55b23ac5e51f264b ./buckwalter_morphan_1_hunspell_patch.tar.gz.1 b3ccdcacb32e57e2653f3e16de707a3e ./buckwalter2unicode.py b3ccdcacb32e57e2653f3e16de707a3e ./buckwalter2unicode.py.1 OK - BuildRequires correct OK - Package has %defattr and permissions on files is good. OK - Package has a correct %clean section. OK - Package has correct buildroot OK - Package is code or permissible content. OK - Packages %doc files don't affect runtime. OK - Package has rm -rf RPM_BUILD_ROOT at top of %install OK - Package compiles and builds on at least one arch. OK - Package has no duplicate files in %files. OK - Package doesn't own any directories other packages own. OK - Package owns all the directories it creates. OK - No rpmlint output. OK - final provides and requires are sane: SHOULD Items: OK - Should build in mock. OK - Should build on all supported archs OK - Should have dist tag OK - Should package latest version Issues: 1. The base source file doesn't match this package name, but it matches up with the other hunspell dicts, so this is not an issue. 2. I was thinking the version didn't look right, but after checking the possible options I think this will be ok, at least I can't see any better option with the way upstream names releases. So, no blockers I can see, so this is re-APPROVED. Sorry for the delay and detour here.
Ok, so let me repost: New Package CVS Request ======================= Package Name: hunspell-ar Short Description: Arabic Hunspell dictionaries Owners: danken.ac.il Branches: devel
cvs done.