Bug 460613

Summary: Approximate Search '~=' Returns unexpected result
Product: [Retired] 389 Reporter: Jie <suke04>
Component: Search EngineAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED CURRENTRELEASE QA Contact: Chandrasekar Kannan <ckannan>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.0.4CC: benl, jgalipea, nhosoi, nkinder, rmeggins
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 8.1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-29 23:06:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 249650, 493682    
Attachments:
Description Flags
Sample data for re-producing bug
none
cvs diff ldapserver/ldap/servers/plugins/syntaxes/phonetic.c
none
cvs commit message none

Description Jie 2008-08-29 01:14:29 UTC
Description of problem:

Approximate Search '~=' Returns unexpected result

Version-Release number of selected component (if applicable):

Fedora Directory Server v1.0.4-1.RHEL4.x86_64 as well as v1.0.4-1.RHEL4.i386


How reproducible:


Steps to Reproduce:

1. import the attached test.ldif file into directory server
2. perform the following LDAP query:

Search DN:  ou=Supplier,dc=evalua,dc=com,dc=au
Filter:     description ~= queensland

Actual results:

Two objects are returned by the search:

 1: ou=B Incorrect Match,ou=Supplier,dc=evalua,dc=com,dc=au
 2: ou=C Correrct Match,ou=Supplier,dc=evalua,dc=com,dc=au


Expected results:

Only the second object "ou=C Correct Match,ou=Supplier,dc=evalua,dc=com,dc=au" should have been returned as its description = "queensland"

Additional info:

The test.ldif file contains three "Supplier" records, each of which have a different description attribute.  The descriptions are: "Foo Bar", "Consulting", and "Queensland".

It appears that the approximate filter is equating "consulting" and "queensland" as being approximately the same when they clearly are not.

Note that the opposite, an approximate search for "consulting" also matches "queensland".

Comment 1 Jie 2008-08-29 01:16:11 UTC
Created attachment 315316 [details]
Sample data for re-producing bug

Comment 2 Noriko Hosoi 2009-01-09 23:55:59 UTC
The analysis for the approximate filter is done in the function function implemented in ldapserver/ldap/servers/plugins/syntaxes/phonetic.c.  There are 2 implementations found in the file.  Currently, one of them METAPHONE is being used.  Using the function, "queensland" is converted to "KNSL", so is "consulting".  Since they match, they are considered to "sound like" each other.

As you pointed out, the algorithm would have some room to improve.  But it's useful, for instance, to find out these kinds of typo:
california ==> KLFR
californa ==> KLFR
califrnia ==> KLFR
callifornia ==> KLFR

Comment 3 Noriko Hosoi 2009-01-12 18:41:39 UTC
I had a discussion with the team and concluded we should increase the "phonetic" length to be examined.  Reopening this bug.

Rich Megginson wrote:
> Noriko Hosoi wrote:
>> I investigated this bug: [Bug 460613] Approximate Search '~=' Returns unexpected result, and concluded it's not a bug.  The approximate filter is working as designed.  So, I closed the bug.
>>
>> After changing the status to "NOTABUG", I ran one more test allowing longer "approximate" string.  I experimentally increased it to 6 from the original length 4.
>>
>> 133 #ifndef MAXPHONEMELEN
>> 134 #define MAXPHONEMELEN        4 ==> 6
>> 135 #endif
>>
>> Then, surely, the 2 sample strings given by the reporter were converted into the different "phonetic" strings.
>>
>> $ phonetic queensland
>> queensland ==> KNSLNT
>> $ phonetic consulting
>> consulting ==> KNSLTN
>>
>> Now I'm wondering if the length 4 is too short to cover words?  Or it's good enough for the current purpose?
> Seems like it should be higher - seems to me that the words you would want to use it on would be longer words
> If you change it to 6, does it still work correctly on short words?  Does it break anything to make it longer (e.g. do we have static buffers lurking anywhere)?

It does not affect the results for the short words.

Comment 4 Noriko Hosoi 2009-01-12 19:09:11 UTC
Created attachment 328770 [details]
cvs diff ldapserver/ldap/servers/plugins/syntaxes/phonetic.c

Change description: increasing the maximum length of "phonetic" string from 4 to 6.  The length 4 is sometimes too short to distinguish long words.  For instance, the sample string Queensland is converted to KNSLNT if there is no limitation; Consulting is to KNSLTNK.  By cutting them at the 5th character, the 2 strings are considered to sound like each other.

By increasing the length, we can distinguish "consulting" from "queensland".
$ ldapsearch -b "dc=example,dc=com"  "givenname~=Consulting" givenname
dn: uid=CHopper9999, ou=Product Testing, dc=example,dc=com
givenname: Consulting

$ ldapsearch -b "dc=example,dc=com"  "givenname~=Queensland" givenname
dn: uid=QLabrador9998, ou=Payroll, dc=example,dc=com
givenname: Queensland

Theoretically, longer maximum length is better, but if an approximate index search is run against non approx indexed attributes, the "phonetic" function is run for every single attribute value in the server, which is extremely expensive.  Considering the quality of the approximate search vs. the performance, the maximum length 6 should be reasonable.

Comment 5 Noriko Hosoi 2009-01-12 19:19:34 UTC
Created attachment 328772 [details]
cvs commit message

Reviewed by Rich (Thank you!!)

Checked in into CVS HEAD.

Comment 6 Jenny Severance 2009-03-20 17:44:32 UTC
fix verified RHEL 4 - DS 8.1

-bash-3.00# /usr/lib64/mozldap6/ldapsearch -h `hostname` -p 389 -D "cn=Directory Manager" -w Secret123 -b "ou=Supplier,dc=example,dc=com" "(description~=queensland)" 
version: 1
dn: ou=C Correct Match,ou=Supplier,dc=example,dc=com
description: Queensland
objectClass: top
objectClass: organizationalunit
ou: C Correct Match

Comment 7 Chandrasekar Kannan 2009-04-29 23:06:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0455.html