Bug 109200 - Need better handling of stopwords and wildcard expansion for Intermedia search
Need better handling of stopwords and wildcard expansion for Intermedia search
Status: CLOSED WONTFIX
Product: Red Hat Web Application Framework
Classification: Retired
Component: other (Show other bugs)
nightly
All Linux
medium Severity medium
: ---
: ---
Assigned To: ccm-bugs-list
Jon Orris
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-11-05 11:24 EST by Scott Seago
Modified: 2007-04-18 12:59 EDT (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-03-09 10:35:50 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Scott Seago 2003-11-05 11:24:04 EST
Description of problem:
Need better handling of stopwords and wildcard expansion for
Intermedia search.

This is only an issue if wildcard expansion of query words is used.
For the London 5.2 branch, we added % to the end of query words. Looks
like the current implementation doesn't do the wildcard part, but
since the Rickshaw implemention will eventually do this, you'll
probably need to do something similar.

From the london 5.2 checkin comment:

Change 37698 by scott@sseago-london-camden on 2003/11/05 11:03:27

        Strange things happen when we try to do wildcard searches on
stopwords. A search including the term "at" will simply ignore the
term. But with "at%", the the expansion will occur, and "at"
will be removed. So if the user searches for "foo at", if we use
wildcards (and thus "foo% AND at%), then "foo attention" will match,
but "foo at" will fail. We need to add the full default stoplist to
the list of words to escape (and not apply wildcards). Strictly
speaking, we  don't need to escape stopwords which aren't keywords,
but there's no harm in doing so, and we don't need to keep two
different lists. I've included the full default English stoplist in
the array:
http://download-west.oracle.com/docs/cd/B10501_01/text.920/a96518/astopsup.htm#43324


In reality, this is not an ideal approach. It only works for an
English database, as the default stoplist is different depending on
the language settings. In addition, stopwords can be added or removed
from the stoplist. Ideally we'd be able to query Oracle for the
currently active stoplist, although I don't know if this is possible.


Version-Release number of selected component (if applicable):
5.2, will be applicable to rickshaw (and possibly troika) if wildcards
are added.

How reproducible:
always, if wildcards are active

Steps to Reproduce:
1. Implement wildcard query expansion (if not on London 5.2)
2. Create an item with a title of "IT Policies". Make sure there are
no other words in this item beginning with "it..."
3. Create another item "Fear itself"
4. After index update, a search for "IT Policies" will not find
anything. A search for "Policies" will find the "IT Policies" document.

Note You need to log in before you can comment on or make changes to this bug.