Bug 460613
Summary: | Approximate Search '~=' Returns unexpected result | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] 389 | Reporter: | Jie <suke04> | ||||||||
Component: | Search Engine | Assignee: | Noriko Hosoi <nhosoi> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Chandrasekar Kannan <ckannan> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 1.0.4 | CC: | benl, jgalipea, nhosoi, nkinder, rmeggins | ||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | 8.1 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2009-04-29 23:06:19 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 249650, 493682 | ||||||||||
Attachments: |
|
Description
Jie
2008-08-29 01:14:29 UTC
Created attachment 315316 [details]
Sample data for re-producing bug
The analysis for the approximate filter is done in the function function implemented in ldapserver/ldap/servers/plugins/syntaxes/phonetic.c. There are 2 implementations found in the file. Currently, one of them METAPHONE is being used. Using the function, "queensland" is converted to "KNSL", so is "consulting". Since they match, they are considered to "sound like" each other. As you pointed out, the algorithm would have some room to improve. But it's useful, for instance, to find out these kinds of typo: california ==> KLFR californa ==> KLFR califrnia ==> KLFR callifornia ==> KLFR I had a discussion with the team and concluded we should increase the "phonetic" length to be examined. Reopening this bug.
Rich Megginson wrote:
> Noriko Hosoi wrote:
>> I investigated this bug: [Bug 460613] Approximate Search '~=' Returns unexpected result, and concluded it's not a bug. The approximate filter is working as designed. So, I closed the bug.
>>
>> After changing the status to "NOTABUG", I ran one more test allowing longer "approximate" string. I experimentally increased it to 6 from the original length 4.
>>
>> 133 #ifndef MAXPHONEMELEN
>> 134 #define MAXPHONEMELEN 4 ==> 6
>> 135 #endif
>>
>> Then, surely, the 2 sample strings given by the reporter were converted into the different "phonetic" strings.
>>
>> $ phonetic queensland
>> queensland ==> KNSLNT
>> $ phonetic consulting
>> consulting ==> KNSLTN
>>
>> Now I'm wondering if the length 4 is too short to cover words? Or it's good enough for the current purpose?
> Seems like it should be higher - seems to me that the words you would want to use it on would be longer words
> If you change it to 6, does it still work correctly on short words? Does it break anything to make it longer (e.g. do we have static buffers lurking anywhere)?
It does not affect the results for the short words.
Created attachment 328770 [details]
cvs diff ldapserver/ldap/servers/plugins/syntaxes/phonetic.c
Change description: increasing the maximum length of "phonetic" string from 4 to 6. The length 4 is sometimes too short to distinguish long words. For instance, the sample string Queensland is converted to KNSLNT if there is no limitation; Consulting is to KNSLTNK. By cutting them at the 5th character, the 2 strings are considered to sound like each other.
By increasing the length, we can distinguish "consulting" from "queensland".
$ ldapsearch -b "dc=example,dc=com" "givenname~=Consulting" givenname
dn: uid=CHopper9999, ou=Product Testing, dc=example,dc=com
givenname: Consulting
$ ldapsearch -b "dc=example,dc=com" "givenname~=Queensland" givenname
dn: uid=QLabrador9998, ou=Payroll, dc=example,dc=com
givenname: Queensland
Theoretically, longer maximum length is better, but if an approximate index search is run against non approx indexed attributes, the "phonetic" function is run for every single attribute value in the server, which is extremely expensive. Considering the quality of the approximate search vs. the performance, the maximum length 6 should be reasonable.
Created attachment 328772 [details]
cvs commit message
Reviewed by Rich (Thank you!!)
Checked in into CVS HEAD.
fix verified RHEL 4 - DS 8.1 -bash-3.00# /usr/lib64/mozldap6/ldapsearch -h `hostname` -p 389 -D "cn=Directory Manager" -w Secret123 -b "ou=Supplier,dc=example,dc=com" "(description~=queensland)" version: 1 dn: ou=C Correct Match,ou=Supplier,dc=example,dc=com description: Queensland objectClass: top objectClass: organizationalunit ou: C Correct Match An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0455.html |