567285 – Abusive spell-checker

Bug 567285 - Abusive spell-checker

Summary: Abusive spell-checker

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rpmlint
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Ville Skyttä
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-02-22 15:07 UTC by Nicolas Mailhot
Modified:	2010-03-12 04:27 UTC (History)
CC List:	3 users (show)
Fixed In Version:	rpmlint-0.95-2.fc13
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-03 20:49:45 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nicolas Mailhot 2010-02-22 15:07:14 UTC

The new spell-checking test is too dumb to ignore names (capitalized words not at the beginning of a sentence). Therefore, it always triggers on packages that include upstream name(s) in the description. (yes I know of en_US camelcase sentence conventions and no, they are not international conventions and people who use them should not make other packagers suffer)

Since it is also too dumb to consolidate multiple occurrences of the same spelling warning, those warnings tend to drown all other rpmlint messages.

It makes rpmlint a lot less effective
rpmlint-0.94-1.fc13.noarch

Comment 1 Ville Skyttä 2010-02-22 17:56:31 UTC

(In reply to comment #0)
> Therefore, it always triggers on packages that
> include upstream name(s) in the description.

It shouldn't always trigger on upstream names, it tries to avoid warning about "components" of the name of the package being checked among other things.  But sure, it will warn about things it doesn't know about and finds misspelled.

> (yes I know of en_US camelcase
> sentence conventions and no, they are not international conventions and people
> who use them should not make other packagers suffer)

http://fedoraproject.org/wiki/Packaging:Guidelines#Summary_and_description
"Please put personal preferences aside and use American English spelling in the summary and description."

> Since it is also too dumb to consolidate multiple occurrences of the same
> spelling warning, those warnings tend to drown all other rpmlint messages.

It has code to avoid warning multiple times about the same word when the word occurs multiple times in the same tag's value.  Filtering across different tags would be misleading IMO.

Do you have a reproducer where these features don't work, or concrete ideas how to improve them?

You can filter out the spell checker messages altogether if you don't like them, or disable the Enchant spell checker which results in the internal (very basic, not far from useless) spell checker being used which generates much less output.  See /usr/share/doc/rpmlint-*/config.example

Comment 2 Nicolas Mailhot 2010-02-22 18:39:35 UTC

(In reply to comment #1)
> (In reply to comment #0)
> > Therefore, it always triggers on packages that
> > include upstream name(s) in the description.
> 
> It shouldn't always trigger on upstream names, it tries to avoid warning about
> "components" of the name of the package being checked among other things.  But
> sure, it will warn about things it doesn't know about and finds misspelled.

rpmlint /tmp/*rpm |grep spelling |sort |uniq

gfs-goschen-fonts.src: W: spelling-error %description -l en_US Bodoni -> Bodkin, Bordon, Bodice
gfs-goschen-fonts.src: W: spelling-error %description -l en_US Didot -> Dido, Di dot, Di-dot
gfs-goschen-fonts.src: W: spelling-error %description -l en_US Georg -> George, Ge org, Ge-org
gfs-goschen-fonts.src: W: spelling-error %description -l en_US Göschen -> Gretchen, Gaucheness, Schelling
gfs-goschen-fonts.src: W: spelling-error %description -l en_US Göschensche -> Schenectady, Nonscheduled, Gaucheness
gfs-goschen-fonts.src: W: spelling-error %description -l en_US Griesbach -> Grievance, Grievous, Gorbachev
gfs-goschen-fonts.src: W: spelling-error %description -l en_US Jakob -> Jacob, Jake, Jakarta
gfs-goschen-fonts.src: W: spelling-error %description -l en_US Joachim -> Poaching, Joaquin, Machismo
gfs-goschen-fonts.src: W: spelling-error %description -l en_US Matthiopoulos -> Matthias
gfs-goschen-fonts.src: W: spelling-error %description -l en_US Prillwitz -> Priscilla, Prioritize, Primarily
gfs-goschen-fonts.src: W: spelling-error %description -l en_US Verlagsbuchhandlung 
gfs-goschen-fonts.src: W: spelling-error Summary(en_US) th -> ht, Th, t
mplus-fonts.src: W: spelling-error %description -l en_US combinations -> combination, combination's, combination s
mplus-fonts.src: W: spelling-error %description -l en_US fullwidth -> full width, full-width, Fullerton
mplus-fonts.src: W: spelling-error %description -l en_US halfwidth -> half width, half-width, halfwit
mplus-fonts.src: W: spelling-error Summary(en_US) Coji -> Colic, Coir, Coin
mplus-fonts.src: W: spelling-error Summary(en_US) Morishita -> Morison, Moorish, Morita
mplus-fonts.src: W: spelling-error Summary(en_US) superfamily -> super family, super-family, superficially
paktype-nashk-basic-fonts.src: W: spelling-error %description -l en_US Lateef -> Latest, Latent, Lateral
paktype-nashk-basic-fonts.src: W: spelling-error %description -l en_US naskh -> Nash, nasal, nasty
paktype-nashk-basic-fonts.src: W: spelling-error %description -l en_US Sagar -> Saar, Agar, Sagan
paratype-pt-sans-fonts.src: W: spelling-error %description -l en_US Korolkova -> Tsiolkovsky, Tereshkova, Walkover
paratype-pt-sans-fonts.src: W: spelling-error %description -l en_US libre -> lire, lib re, lib-re
paratype-pt-sans-fonts.src: W: spelling-error %description -l en_US th -> ht, Th, t
paratype-pt-sans-fonts.src: W: spelling-error %description -l en_US Umpeleva -> Relevant, Elevator, Elevate
paratype-pt-sans-fonts.src: W: spelling-error %description -l en_US Yefimov -> Asimov, Immovable, Immovably
ubuntutitle-fonts.src: W: spelling-error %description -l en_US Fitzsimon -> Fitzroy, Fitzpatrick, Fitzgerald

(random selection, culling all the duplicates)

> > (yes I know of en_US camelcase
> > sentence conventions and no, they are not international conventions and people
> > who use them should not make other packagers suffer)
> 
> http://fedoraproject.org/wiki/Packaging:Guidelines#Summary_and_description
> "Please put personal preferences aside and use American English spelling in the
> summary and description."

This is not spelling, and in fact even the over-anal spellcheker rpmlint uses now does not require camelcase sentence. American English can thanksfully be written with normal casing conventions that help identify names that should not be spellchecked.

> > Since it is also too dumb to consolidate multiple occurrences of the same
> > spelling warning, those warnings tend to drown all other rpmlint messages.
> 
> It has code to avoid warning multiple times about the same word when the word
> occurs multiple times in the same tag's value.  Filtering across different tags
> would be misleading IMO.

What's misleading today is that there are so much noise about spelling which has no value being checked people are missing the actual warnings they should do something about.

> Do you have a reproducer where these features don't work, or concrete ideas how
> to improve them?

Do not check capitalized words not at the beginning of a sentence

> You can filter out the spell checker messages altogether if you don't like
> them, or disable the Enchant spell checker which results in the internal (very
> basic, not far from useless) spell checker being used which generates much less
> output.  See /usr/share/doc/rpmlint-*/config.example

Already done that (got sick of missing warnings), does not help when reviewing other people packages that include problems that should have been detected at rpmlint time.

Comment 3 Ville Skyttä 2010-02-22 19:17:30 UTC

(In reply to comment #2)
> rpmlint /tmp/*rpm |grep spelling |sort |uniq
[...]

None of these seem to be reports of a "component" of the name of the package in question being flagged as a misspelling.  But never mind.

> (random selection, culling all the duplicates)

It's a bug if exact duplicate messages are emitted.  Could you check your random selection if there are any?

> Do not check capitalized words not at the beginning of a sentence

I don't think this can be done using the python-enchant API, but we can skip all  capitalized words.  I'll experiment with this.

Comment 4 Nicolas Mailhot 2010-02-22 19:52:07 UTC

(In reply to comment #3)
> (In reply to comment #2)
> > rpmlint /tmp/*rpm |grep spelling |sort |uniq
> [...]
> 
> None of these seem to be reports of a "component" of the name of the package in
> question being flagged as a misspelling.  But never mind.

It still upstream names (mainly the author names, not the component names, though that was the case for Göschen) that seem to be caught most often.

> > (random selection, culling all the duplicates)
> 
> It's a bug if exact duplicate messages are emitted.  Could you check your
> random selection if there are any?

Will do as soon as the multi-hour rpmlint-using script which is running today is finished. Otherwise it will wedge results

Without being able to test, I think that what happens most often is that packages with multiple subpackages all share parts of the same description, so rpmlint *rpm in the build dir will report the same bogus spellcheck errors many times over, and hide other messages in the mass

Comment 5 Ville Skyttä 2010-02-23 17:24:01 UTC

(In reply to comment #3)
> (In reply to comment #2)
> 
> > Do not check capitalized words not at the beginning of a sentence
> 
> I don't think this can be done using the python-enchant API, but we can skip
> all  capitalized words.  I'll experiment with this.

I was wrong, it can be done with the python-enchant API, and as expected, does reduce noise significantly at the expense of missing a few positives that should be flagged and were flagged before.  But I think it's an improvement overall and is committed upstream now.

(In reply to comment #4)
> Without being able to test, I think that what happens most often is that
> packages with multiple subpackages all share parts of the same description, so
> rpmlint *rpm in the build dir will report the same bogus spellcheck errors many
> times over

The scope of dupe avoidance could in theory be extended to common srpm when multiple packages are being checked but I'm not quite convinced that it's necessarily a good thing or worth the trouble.

Comment 6 Nicolas Mailhot 2010-02-23 18:20:30 UTC

As long as the noise is reduced to reasonable levels I agree it's not worth the trouble. To reduce it more the en_US dictionnary needs to be corrected and I have not the faintest idea where such correction demands can be sent

Comment 7 Ville Skyttä 2010-02-23 19:53:59 UTC

"man enchant ; cat /usr/share/enchant/enchant.ordering" gives some hints.  AFAIU the "myspell" they talk about means hunspell in Fedora.

Comment 8 Nicolas Mailhot 2010-03-03 21:38:31 UTC

Thanks a lot

Comment 9 Fedora Update System 2010-03-06 20:14:55 UTC

rpmlint-0.95-2.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/rpmlint-0.95-2.fc13

Comment 10 Fedora Update System 2010-03-06 20:16:15 UTC

rpmlint-0.95-2.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/rpmlint-0.95-2.fc12

Comment 11 Fedora Update System 2010-03-11 13:31:12 UTC

rpmlint-0.95-2.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 12 Fedora Update System 2010-03-12 04:27:58 UTC

rpmlint-0.95-2.fc12 has been pushed to the Fedora 12 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.