Bug 460613 - Approximate Search '~=' Returns unexpected result
Summary: Approximate Search '~=' Returns unexpected result
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: 389
Classification: Retired
Component: Search Engine
Version: 1.0.4
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Noriko Hosoi
QA Contact: Chandrasekar Kannan
URL:
Whiteboard:
Depends On:
Blocks: 249650 FDS1.2.0
TreeView+ depends on / blocked
 
Reported: 2008-08-29 01:14 UTC by Jie
Modified: 2015-01-04 23:33 UTC (History)
5 users (show)

Fixed In Version: 8.1
Clone Of:
Environment:
Last Closed: 2009-04-29 23:06:19 UTC
Embargoed:


Attachments (Terms of Use)
Sample data for re-producing bug (3.70 KB, text/plain)
2008-08-29 01:16 UTC, Jie
no flags Details
cvs diff ldapserver/ldap/servers/plugins/syntaxes/phonetic.c (614 bytes, patch)
2009-01-12 19:09 UTC, Noriko Hosoi
no flags Details | Diff
cvs commit message (887 bytes, text/plain)
2009-01-12 19:19 UTC, Noriko Hosoi
no flags Details

Description Jie 2008-08-29 01:14:29 UTC
Description of problem:

Approximate Search '~=' Returns unexpected result

Version-Release number of selected component (if applicable):

Fedora Directory Server v1.0.4-1.RHEL4.x86_64 as well as v1.0.4-1.RHEL4.i386


How reproducible:


Steps to Reproduce:

1. import the attached test.ldif file into directory server
2. perform the following LDAP query:

Search DN:  ou=Supplier,dc=evalua,dc=com,dc=au
Filter:     description ~= queensland

Actual results:

Two objects are returned by the search:

 1: ou=B Incorrect Match,ou=Supplier,dc=evalua,dc=com,dc=au
 2: ou=C Correrct Match,ou=Supplier,dc=evalua,dc=com,dc=au


Expected results:

Only the second object "ou=C Correct Match,ou=Supplier,dc=evalua,dc=com,dc=au" should have been returned as its description = "queensland"

Additional info:

The test.ldif file contains three "Supplier" records, each of which have a different description attribute.  The descriptions are: "Foo Bar", "Consulting", and "Queensland".

It appears that the approximate filter is equating "consulting" and "queensland" as being approximately the same when they clearly are not.

Note that the opposite, an approximate search for "consulting" also matches "queensland".

Comment 1 Jie 2008-08-29 01:16:11 UTC
Created attachment 315316 [details]
Sample data for re-producing bug

Comment 2 Noriko Hosoi 2009-01-09 23:55:59 UTC
The analysis for the approximate filter is done in the function function implemented in ldapserver/ldap/servers/plugins/syntaxes/phonetic.c.  There are 2 implementations found in the file.  Currently, one of them METAPHONE is being used.  Using the function, "queensland" is converted to "KNSL", so is "consulting".  Since they match, they are considered to "sound like" each other.

As you pointed out, the algorithm would have some room to improve.  But it's useful, for instance, to find out these kinds of typo:
california ==> KLFR
californa ==> KLFR
califrnia ==> KLFR
callifornia ==> KLFR

Comment 3 Noriko Hosoi 2009-01-12 18:41:39 UTC
I had a discussion with the team and concluded we should increase the "phonetic" length to be examined.  Reopening this bug.

Rich Megginson wrote:
> Noriko Hosoi wrote:
>> I investigated this bug: [Bug 460613] Approximate Search '~=' Returns unexpected result, and concluded it's not a bug.  The approximate filter is working as designed.  So, I closed the bug.
>>
>> After changing the status to "NOTABUG", I ran one more test allowing longer "approximate" string.  I experimentally increased it to 6 from the original length 4.
>>
>> 133 #ifndef MAXPHONEMELEN
>> 134 #define MAXPHONEMELEN        4 ==> 6
>> 135 #endif
>>
>> Then, surely, the 2 sample strings given by the reporter were converted into the different "phonetic" strings.
>>
>> $ phonetic queensland
>> queensland ==> KNSLNT
>> $ phonetic consulting
>> consulting ==> KNSLTN
>>
>> Now I'm wondering if the length 4 is too short to cover words?  Or it's good enough for the current purpose?
> Seems like it should be higher - seems to me that the words you would want to use it on would be longer words
> If you change it to 6, does it still work correctly on short words?  Does it break anything to make it longer (e.g. do we have static buffers lurking anywhere)?

It does not affect the results for the short words.

Comment 4 Noriko Hosoi 2009-01-12 19:09:11 UTC
Created attachment 328770 [details]
cvs diff ldapserver/ldap/servers/plugins/syntaxes/phonetic.c

Change description: increasing the maximum length of "phonetic" string from 4 to 6.  The length 4 is sometimes too short to distinguish long words.  For instance, the sample string Queensland is converted to KNSLNT if there is no limitation; Consulting is to KNSLTNK.  By cutting them at the 5th character, the 2 strings are considered to sound like each other.

By increasing the length, we can distinguish "consulting" from "queensland".
$ ldapsearch -b "dc=example,dc=com"  "givenname~=Consulting" givenname
dn: uid=CHopper9999, ou=Product Testing, dc=example,dc=com
givenname: Consulting

$ ldapsearch -b "dc=example,dc=com"  "givenname~=Queensland" givenname
dn: uid=QLabrador9998, ou=Payroll, dc=example,dc=com
givenname: Queensland

Theoretically, longer maximum length is better, but if an approximate index search is run against non approx indexed attributes, the "phonetic" function is run for every single attribute value in the server, which is extremely expensive.  Considering the quality of the approximate search vs. the performance, the maximum length 6 should be reasonable.

Comment 5 Noriko Hosoi 2009-01-12 19:19:34 UTC
Created attachment 328772 [details]
cvs commit message

Reviewed by Rich (Thank you!!)

Checked in into CVS HEAD.

Comment 6 Jenny Severance 2009-03-20 17:44:32 UTC
fix verified RHEL 4 - DS 8.1

-bash-3.00# /usr/lib64/mozldap6/ldapsearch -h `hostname` -p 389 -D "cn=Directory Manager" -w Secret123 -b "ou=Supplier,dc=example,dc=com" "(description~=queensland)" 
version: 1
dn: ou=C Correct Match,ou=Supplier,dc=example,dc=com
description: Queensland
objectClass: top
objectClass: organizationalunit
ou: C Correct Match

Comment 7 Chandrasekar Kannan 2009-04-29 23:06:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0455.html


Note You need to log in before you can comment on or make changes to this bug.