Red Hat Bugzilla – Bug 730189
Overhaul translation memory similarity algorithm
Last modified: 2012-04-23 00:33:18 EDT
Description of problem:
The current similarity algorithm is based on counting the number of matching trigrams, so that it can find fuzzy matches, but it is sometimes pretty unintuitive. We need to check that it is actually calculating correctly, and see if we can modify the algorithm to give better results.
Things to check:
a. Is the calculation definitely comparing source strings to source strings?
b. If trying to match a short string, will a much larger string which contains the target string receive a suitably high score? And if so, should we artificially reduce it from 100%? 
c. If two strings both contain exact substring matches for a target string, how can we ensure that the shorter string receives a higher similarity score?
d. Is it feasible to highlight the matching trigrams?
 translators may assume that a 100% match is safe to re-use as is, but this is not true if only a substring matched.
Could we run the comparison the other way around for any matched strings to get a second score? Matches that have only a substring would have a high match score but a low reverse-match score. We could then show both scores or an average score, or just have it contribute to sorting.
Hmm, I'll have to think about the reverse-matching thing, but that makes me realise something: if it's really similarity, it should have the same value in both directions, ie it should be commutative. But perhaps we don't want "similarity" so much as "suitability".
Using the reverse score as a secondary sort could help. It should handle the ordering problem, but by itself wouldn't solve the "100% must be safe" problem.
Highlighting the parts of the TM source string that match the search string would probably give translators a clearer indication of how good a match the source strings are, and alert them to the fact that the match contains more than just the search string.
Alternatively, the non-matched parts of the string could be shown in red or similar as a clear warning that something isn't right with the matched string.
Created attachment 518378 [details]
Zanata TM accuracy in short strings
(In reply to comment #0)
> b. If trying to match a short string, will a much larger string which contains
> the target string receive a suitably high score? And if so, should we
> artificially reduce it from 100%? 
> c. If two strings both contain exact substring matches for a target string, how
> can we ensure that the shorter string receives a higher similarity score?
Screenshot attached to demonstrate the current behaviour, two rather short strings, but only one of them should be 100% match.
> d. Is it feasible to highlight the matching trigrams?
I'd like to make this a feature request, albeit not urgent. It is extremely helpful to have the matching parts highlighted. Example use case: An long entry has been changed by the writer slightly since the last version. If the matching parts of the message are highlighted it is easy to spot the *one* word that has changed, rather than having to compare the whole message carefully. The translation memory in Lokalize implements this exact feature and more. Happy to demonstrate.
Assigning to Scrum product owner for prioritisation.
https://github.com/zanata/zanata/commit/6cdcf30e6c8e0f5b4ffa4c7926aed41ba59b413e (in 1.4 branch) switches to standard Levenshtein distance instead of substring distance, which means that only exact matches should get 100% similarity now.
The diff highlight feature will be tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=756264
verified in 1.4
verified in 1.5