Bug 825202
Summary: | RFE: Translation Memory suggestions should not consider a portion of word. | ||
---|---|---|---|
Product: | [Retired] Zanata | Reporter: | Manoj Kumar Giri <mgiri> |
Component: | Usability | Assignee: | Sean Flanigan <sflaniga> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ding-Yi Chen <dchen> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 1.6-SNAPSHOT | CC: | ankit, dchen, sflaniga, zanata-bugs |
Target Milestone: | --- | ||
Target Release: | 1.7 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: |
Feature:
Translation memory scoring based on whole words
Reason:
Strings with partial word matches appear higher than expected in TM results because they are based on sequences of similar characters (trigrams), rather than entire words as is natural for human beings.
Result (if any):
Whole word matches will generally be shown before partial word matches. Note that partial word matches may still appear in translation memory results, pending RFE bug 845898.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2012-09-11 05:11:19 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Manoj Kumar Giri
2012-05-25 11:04:20 UTC
We do need to improve the similarity scoring algorithm, and making it word-based should help (as long as we still have a way of fuzzy-matching similar words). https://github.com/zanata/zanata/commit/4d1acb815875c8ca0bcf0d3b56dfd9d9a295bc77 implements word-based similarity scoring (excluding stop words). It doesn't try to deal with stemming or fuzzy word matching, but I don't think it would improve results enough to be worth the complexity. Test with Zanata version 1.8-SNAPSHOT (20120731-0025) and 1.7.1-SNAPSHOT (20120731-0013) On 1.8, the search results are irrevalent with search terms. On 1.7, with either fuzzy, phrase, lucene: if type "fault", "default" will also returned with similarity 0. Should we remove the results with similarity 0? > On 1.8, the search results are irrevalent with search terms. I don't understand. > On 1.7, with either fuzzy, phrase, lucene: if type "fault", "default" > will also returned with similarity 0. > Should we remove the results with similarity 0? I don't think so, it's just an artefact of different scoring. Lucene is using fuzzy trigram similarity to return results, whereas the scoring is based on words (excluding stop words). 0 doesn't really mean there is no similarity, it just means the word score can't measure it. (In reply to comment #4) > > On 1.8, the search results are irrevalent with search terms. > > I don't understand. For example, typing on 'default' on translation memory search, the 2nd and 3rd results do not contains "default" Test with Zanata version 1.8-SNAPSHOT (20120802-0024): Reindex does not help. REASSIGNED Note that this RFE is only about changing the scoring algorithm. It doesn't change which TM results are returned, only their sort order and the similarity score shown against them. Ding, could you please put in a separate bug for the problem you've found with TM results? (In reply to comment #8) > Note that this RFE is only about changing the scoring algorithm. It doesn't > change which TM results are returned, only their sort order and the > similarity score shown against them. I think what reporter actually mean is to have a mode that search only the exact word. > Ding, could you please put in a separate bug for the problem you've found > with TM results? I think what I found in TM results is still withing this bug, as the behavor does not actually change even after you delete the index file, restart the searver and rebuild the index. BTW, I have found the same symptom with mysql based servers as well. However, I do think we should have a bug that says we should reduce the need of reindexing. (In reply to comment #12) > (In reply to comment #8) > > Note that this RFE is only about changing the scoring algorithm. It doesn't > > change which TM results are returned, only their sort order and the > > similarity score shown against them. > > I think what reporter actually mean is to have a mode that search only the > exact word. I think what's implemented now will cover that, since the word-based matches will sort to the top. Any matches which are trigram matches only will have low similarities. However, if we want to change our indexing to be word-based rather than trigram-based, I think that should be another RFE. > > Ding, could you please put in a separate bug for the problem you've found > > with TM results? > > I think what I found in TM results is still withing this bug, as the behavor > does not actually change even after you delete the index file, restart the > server and rebuild the index. I was referring to the TM returning matches which had no trigrams in common, which appears to have been resolved by rebuilding the index. So no need for a bug to be filed. > BTW, I have found the same symptom with mysql based servers as well. > However, I do think we should have a bug that says we should reduce the need > of reindexing. Agreed. I'm working on bug 845896. Well, in this case, I file the 'true' word base index as bug 945898. This bug can then be VERIFIED with: 1.8.0-SNAPSHOT (20120806-0025) 1.7.2-SNAPSHOT (20120806-0012) Thanks. Going by the See Also list, I think you mean bug 845898, not 945898. *** Bug 845898 has been marked as a duplicate of this bug. *** |