Bug 825202

Summary: RFE: Translation Memory suggestions should not consider a portion of word.
Product: [Retired] Zanata Reporter: Manoj Kumar Giri <mgiri>
Component: UsabilityAssignee: Sean Flanigan <sflaniga>
Status: CLOSED CURRENTRELEASE QA Contact: Ding-Yi Chen <dchen>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 1.6-SNAPSHOTCC: ankit, dchen, sflaniga, zanata-bugs
Target Milestone: ---   
Target Release: 1.7   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Feature: Translation memory scoring based on whole words Reason: Strings with partial word matches appear higher than expected in TM results because they are based on sequences of similar characters (trigrams), rather than entire words as is natural for human beings. Result (if any): Whole word matches will generally be shown before partial word matches. Note that partial word matches may still appear in translation memory results, pending RFE bug 845898.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-11 05:11:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Manoj Kumar Giri 2012-05-25 11:04:20 UTC
Description of problem:
While translating a file  in Zanata i got a TM suggestion for a portion of a word.

Version-Release number of selected component (if applicable):

1.6

How reproducible:
Everytime

Steps to Reproduce: Check translation memory suggestions for single word strings.
1.As an example,
Open File : Katello (1.0) https://translate.zanata.org/zanata/project/view/katello [Open URL]
msgID :"End"

2. Check for TM Suggestions.
It will show you: "Trend"    ----- similarity: 40%.

3.
  
Actual results:

"Trend"    ----- similarity: 40%.
Counting on the last three letters from Trend, which has no logic.

Expected results:
It shouldn't give suggestion like this. Matching should be done by words in a string not as letters in a word.

Additional info:

Comment 1 Sean Flanigan 2012-06-25 07:39:42 UTC
We do need to improve the similarity scoring algorithm, and making it word-based should help (as long as we still have a way of fuzzy-matching similar words).

Comment 2 Sean Flanigan 2012-07-17 07:59:29 UTC
https://github.com/zanata/zanata/commit/4d1acb815875c8ca0bcf0d3b56dfd9d9a295bc77 implements word-based similarity scoring (excluding stop words).  It doesn't try to deal with stemming or fuzzy word matching, but I don't think it would improve results enough to be worth the complexity.

Comment 3 Ding-Yi Chen 2012-07-31 02:19:21 UTC
Test with Zanata version 1.8-SNAPSHOT (20120731-0025)
and 1.7.1-SNAPSHOT (20120731-0013)

On 1.8, the search results are irrevalent with search terms.
On 1.7, with either fuzzy, phrase, lucene: if type "fault", "default" will also returned with similarity 0.

Should we remove the results with similarity 0?

Comment 4 Sean Flanigan 2012-07-31 02:50:05 UTC
> On 1.8, the search results are irrevalent with search terms.

I don't understand.

> On 1.7, with either fuzzy, phrase, lucene: if type "fault", "default" 
> will also returned with similarity 0.

> Should we remove the results with similarity 0?

I don't think so, it's just an artefact of different scoring.  Lucene is using fuzzy trigram similarity to return results, whereas the scoring is based on words (excluding stop words).  0 doesn't really mean there is no similarity, it just means the word score can't measure it.

Comment 5 Ding-Yi Chen 2012-08-01 04:23:40 UTC
(In reply to comment #4)
> > On 1.8, the search results are irrevalent with search terms.
> 
> I don't understand.

For example, typing on 'default' on translation memory search, 
the 2nd and 3rd results do not contains "default"

Comment 6 Ding-Yi Chen 2012-08-02 04:24:28 UTC
Test with Zanata version 1.8-SNAPSHOT (20120802-0024):

Reindex does not help.
REASSIGNED

Comment 8 Sean Flanigan 2012-08-02 05:28:52 UTC
Note that this RFE is only about changing the scoring algorithm.  It doesn't change which TM results are returned, only their sort order and the similarity score shown against them.

Ding, could you please put in a separate bug for the problem you've found with TM results?

Comment 12 Ding-Yi Chen 2012-08-06 02:50:33 UTC
(In reply to comment #8)
> Note that this RFE is only about changing the scoring algorithm.  It doesn't
> change which TM results are returned, only their sort order and the
> similarity score shown against them.

I think what reporter actually mean is to have a mode that search only the exact word. 

> Ding, could you please put in a separate bug for the problem you've found
> with TM results?

I think what I found in TM results is still withing this bug, as the behavor does not actually change even after you delete the index file, restart the searver and rebuild the index.

BTW, I have found the same symptom with mysql based servers as well.

However, I do think we should have a bug that says we should reduce the need of reindexing.

Comment 13 Sean Flanigan 2012-08-06 03:30:33 UTC
(In reply to comment #12)
> (In reply to comment #8)
> > Note that this RFE is only about changing the scoring algorithm.  It doesn't
> > change which TM results are returned, only their sort order and the
> > similarity score shown against them.
> 
> I think what reporter actually mean is to have a mode that search only the
> exact word. 

I think what's implemented now will cover that, since the word-based matches will sort to the top.  Any matches which are trigram matches only will have low similarities.  However, if we want to change our indexing to be word-based rather than trigram-based, I think that should be another RFE.
 
> > Ding, could you please put in a separate bug for the problem you've found
> > with TM results?
> 
> I think what I found in TM results is still withing this bug, as the behavor
> does not actually change even after you delete the index file, restart the
> server and rebuild the index.

I was referring to the TM returning matches which had no trigrams in common, which appears to have been resolved by rebuilding the index.  So no need for a bug to be filed.
 
> BTW, I have found the same symptom with mysql based servers as well.
> However, I do think we should have a bug that says we should reduce the need
> of reindexing.

Agreed.  I'm working on bug 845896.

Comment 14 Ding-Yi Chen 2012-08-06 03:57:31 UTC
Well, in this case, I file the 'true' word base index as bug 945898.

This bug can then be VERIFIED with:
1.8.0-SNAPSHOT (20120806-0025)
1.7.2-SNAPSHOT (20120806-0012)

Comment 15 Sean Flanigan 2012-08-06 04:30:28 UTC
Thanks.  Going by the See Also list, I think you mean bug 845898, not 945898.

Comment 16 Ding-Yi Chen 2012-11-09 01:26:57 UTC
*** Bug 845898 has been marked as a duplicate of this bug. ***