Bug 109200 - Need better handling of stopwords and wildcard expansion for Intermedia search
Summary: Need better handling of stopwords and wildcard expansion for Intermedia search
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Web Application Framework
Classification: Retired
Component: other
Version: nightly
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: ccm-bugs-list
QA Contact: Jon Orris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-11-05 16:24 UTC by Scott Seago
Modified: 2007-04-18 16:59 UTC (History)
0 users

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2006-03-09 15:35:50 UTC
Embargoed:


Attachments (Terms of Use)

Description Scott Seago 2003-11-05 16:24:04 UTC
Description of problem:
Need better handling of stopwords and wildcard expansion for
Intermedia search.

This is only an issue if wildcard expansion of query words is used.
For the London 5.2 branch, we added % to the end of query words. Looks
like the current implementation doesn't do the wildcard part, but
since the Rickshaw implemention will eventually do this, you'll
probably need to do something similar.

From the london 5.2 checkin comment:

Change 37698 by scott@sseago-london-camden on 2003/11/05 11:03:27

        Strange things happen when we try to do wildcard searches on
stopwords. A search including the term "at" will simply ignore the
term. But with "at%", the the expansion will occur, and "at"
will be removed. So if the user searches for "foo at", if we use
wildcards (and thus "foo% AND at%), then "foo attention" will match,
but "foo at" will fail. We need to add the full default stoplist to
the list of words to escape (and not apply wildcards). Strictly
speaking, we  don't need to escape stopwords which aren't keywords,
but there's no harm in doing so, and we don't need to keep two
different lists. I've included the full default English stoplist in
the array:
http://download-west.oracle.com/docs/cd/B10501_01/text.920/a96518/astopsup.htm#43324


In reality, this is not an ideal approach. It only works for an
English database, as the default stoplist is different depending on
the language settings. In addition, stopwords can be added or removed
from the stoplist. Ideally we'd be able to query Oracle for the
currently active stoplist, although I don't know if this is possible.


Version-Release number of selected component (if applicable):
5.2, will be applicable to rickshaw (and possibly troika) if wildcards
are added.

How reproducible:
always, if wildcards are active

Steps to Reproduce:
1. Implement wildcard query expansion (if not on London 5.2)
2. Create an item with a title of "IT Policies". Make sure there are
no other words in this item beginning with "it..."
3. Create another item "Fear itself"
4. After index update, a search for "IT Policies" will not find
anything. A search for "Policies" will find the "IT Policies" document.


Note You need to log in before you can comment on or make changes to this bug.