Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 730189

Summary:

Overhaul translation memory similarity algorithm

Product:

[Retired] Zanata

Reporter:

Sean Flanigan <sflaniga>

Component:

Component-Logic

Assignee:

zanata-dev-internal <zanata-dev-internal>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Ding-Yi Chen <dchen>

Severity:

medium

Docs Contact:

Priority:

high

Version:

1.3

CC:

damason, hpeters, mgiri, runab, zanata-bugs

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-04-23 04:33:18 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Zanata TM accuracy in short strings	none

Description Sean Flanigan 2011-08-12 04:55:29 UTC

Description of problem:

The current similarity algorithm is based on counting the number of matching trigrams, so that it can find fuzzy matches, but it is sometimes pretty unintuitive.  We need to check that it is actually calculating correctly, and see if we can modify the algorithm to give better results.

Things to check:

a. Is the calculation definitely comparing source strings to source strings?
b. If trying to match a short string, will a much larger string which contains the target string receive a suitably high score?  And if so, should we artificially reduce it from 100%? [1]
c. If two strings both contain exact substring matches for a target string, how can we ensure that the shorter string receives a higher similarity score?
d. Is it feasible to highlight the matching trigrams?


[1] translators may assume that a 100% match is safe to re-use as is, but this is not true if only a substring matched.

Comment 1 David Mason 2011-08-15 07:25:47 UTC

Could we run the comparison the other way around for any matched strings to get a second score? Matches that have only a substring would have a high match score but a low reverse-match score. We could then show both scores or an average score, or just have it contribute to sorting.

Comment 2 Sean Flanigan 2011-08-15 07:36:54 UTC

Hmm, I'll have to think about the reverse-matching thing, but that makes me realise something: if it's really similarity, it should have the same value in both directions, ie it should be commutative.  But perhaps we don't want "similarity" so much as "suitability".

Using the reverse score as a secondary sort could help.  It should handle the ordering problem, but by itself wouldn't solve the "100% must be safe" problem.

Comment 3 David Mason 2011-08-15 07:54:38 UTC

Highlighting the parts of the TM source string that match the search string would probably give translators a clearer indication of how good a match the source strings are, and alert them to the fact that the match contains more than just the search string.

Alternatively, the non-matched parts of the string could be shown in red or similar as a clear warning that something isn't right with the matched string.

Comment 4 Hedda Peters 2011-08-16 01:22:33 UTC

Created attachment 518378 [details]
Zanata TM accuracy in short strings

Comment 5 Hedda Peters 2011-08-16 01:23:10 UTC

(In reply to comment #0)
 
> b. If trying to match a short string, will a much larger string which contains
> the target string receive a suitably high score?  And if so, should we
> artificially reduce it from 100%? [1]
> c. If two strings both contain exact substring matches for a target string, how
> can we ensure that the shorter string receives a higher similarity score?

Screenshot attached to demonstrate the current behaviour, two rather short strings, but only one of them should be 100% match.


> d. Is it feasible to highlight the matching trigrams?

I'd like to make this a feature request, albeit not urgent. It is extremely helpful to have the matching parts highlighted. Example use case: An long entry has been changed by the writer slightly since the last version. If the matching parts of the message are highlighted it is easy to spot the *one* word that has changed, rather than having to compare the whole message carefully. The translation memory in Lokalize implements this exact feature and more. Happy to demonstrate.

Comment 6 Sean Flanigan 2011-09-07 04:33:40 UTC

Assigning to Scrum product owner for prioritisation.

Comment 7 Sean Flanigan 2011-11-23 04:17:58 UTC

https://github.com/zanata/zanata/commit/6cdcf30e6c8e0f5b4ffa4c7926aed41ba59b413e (in 1.4 branch) switches to standard Levenshtein distance instead of substring distance, which means that only exact matches should get 100% similarity now.

The diff highlight feature will be tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=756264

Comment 8 David Mason 2012-01-06 04:34:17 UTC

verified in 1.4

Comment 9 David Mason 2012-01-31 06:08:38 UTC

verified in 1.5