Description of problem:
Approximate Search '~=' Returns unexpected result
Version-Release number of selected component (if applicable):
Fedora Directory Server v1.0.4-1.RHEL4.x86_64 as well as v1.0.4-1.RHEL4.i386
Steps to Reproduce:
1. import the attached test.ldif file into directory server
2. perform the following LDAP query:
Search DN: ou=Supplier,dc=evalua,dc=com,dc=au
Filter: description ~= queensland
Two objects are returned by the search:
1: ou=B Incorrect Match,ou=Supplier,dc=evalua,dc=com,dc=au
2: ou=C Correrct Match,ou=Supplier,dc=evalua,dc=com,dc=au
Only the second object "ou=C Correct Match,ou=Supplier,dc=evalua,dc=com,dc=au" should have been returned as its description = "queensland"
The test.ldif file contains three "Supplier" records, each of which have a different description attribute. The descriptions are: "Foo Bar", "Consulting", and "Queensland".
It appears that the approximate filter is equating "consulting" and "queensland" as being approximately the same when they clearly are not.
Note that the opposite, an approximate search for "consulting" also matches "queensland".
Created attachment 315316 [details]
Sample data for re-producing bug
The analysis for the approximate filter is done in the function function implemented in ldapserver/ldap/servers/plugins/syntaxes/phonetic.c. There are 2 implementations found in the file. Currently, one of them METAPHONE is being used. Using the function, "queensland" is converted to "KNSL", so is "consulting". Since they match, they are considered to "sound like" each other.
As you pointed out, the algorithm would have some room to improve. But it's useful, for instance, to find out these kinds of typo:
california ==> KLFR
californa ==> KLFR
califrnia ==> KLFR
callifornia ==> KLFR
I had a discussion with the team and concluded we should increase the "phonetic" length to be examined. Reopening this bug.
Rich Megginson wrote:
> Noriko Hosoi wrote:
>> I investigated this bug: [Bug 460613] Approximate Search '~=' Returns unexpected result, and concluded it's not a bug. The approximate filter is working as designed. So, I closed the bug.
>> After changing the status to "NOTABUG", I ran one more test allowing longer "approximate" string. I experimentally increased it to 6 from the original length 4.
>> 133 #ifndef MAXPHONEMELEN
>> 134 #define MAXPHONEMELEN 4 ==> 6
>> 135 #endif
>> Then, surely, the 2 sample strings given by the reporter were converted into the different "phonetic" strings.
>> $ phonetic queensland
>> queensland ==> KNSLNT
>> $ phonetic consulting
>> consulting ==> KNSLTN
>> Now I'm wondering if the length 4 is too short to cover words? Or it's good enough for the current purpose?
> Seems like it should be higher - seems to me that the words you would want to use it on would be longer words
> If you change it to 6, does it still work correctly on short words? Does it break anything to make it longer (e.g. do we have static buffers lurking anywhere)?
It does not affect the results for the short words.
Created attachment 328770 [details]
cvs diff ldapserver/ldap/servers/plugins/syntaxes/phonetic.c
Change description: increasing the maximum length of "phonetic" string from 4 to 6. The length 4 is sometimes too short to distinguish long words. For instance, the sample string Queensland is converted to KNSLNT if there is no limitation; Consulting is to KNSLTNK. By cutting them at the 5th character, the 2 strings are considered to sound like each other.
By increasing the length, we can distinguish "consulting" from "queensland".
$ ldapsearch -b "dc=example,dc=com" "givenname~=Consulting" givenname
dn: uid=CHopper9999, ou=Product Testing, dc=example,dc=com
$ ldapsearch -b "dc=example,dc=com" "givenname~=Queensland" givenname
dn: uid=QLabrador9998, ou=Payroll, dc=example,dc=com
Theoretically, longer maximum length is better, but if an approximate index search is run against non approx indexed attributes, the "phonetic" function is run for every single attribute value in the server, which is extremely expensive. Considering the quality of the approximate search vs. the performance, the maximum length 6 should be reasonable.
Created attachment 328772 [details]
cvs commit message
Reviewed by Rich (Thank you!!)
Checked in into CVS HEAD.
fix verified RHEL 4 - DS 8.1
-bash-3.00# /usr/lib64/mozldap6/ldapsearch -h `hostname` -p 389 -D "cn=Directory Manager" -w Secret123 -b "ou=Supplier,dc=example,dc=com" "(description~=queensland)"
dn: ou=C Correct Match,ou=Supplier,dc=example,dc=com
ou: C Correct Match
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.