730189 – Overhaul translation memory similarity algorithm

Bug 730189 - Overhaul translation memory similarity algorithm

Summary: Overhaul translation memory similarity algorithm

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Zanata
Classification:	Retired
Component:	Component-Logic
Sub Component:
Version:	1.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	zanata-dev-internal
QA Contact:	Ding-Yi Chen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-08-12 04:55 UTC by Sean Flanigan
Modified:	2012-04-23 04:33 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Story Points:	---
Clone Of:
Environment:
Last Closed:	2012-04-23 04:33:18 UTC
Embargoed:

Attachments	(Terms of Use)
Zanata TM accuracy in short strings (322.39 KB, image/png) 2011-08-16 01:22 UTC, Hedda Peters	no flags	Details
View All

Description Sean Flanigan 2011-08-12 04:55:29 UTC

Description of problem:

The current similarity algorithm is based on counting the number of matching trigrams, so that it can find fuzzy matches, but it is sometimes pretty unintuitive.  We need to check that it is actually calculating correctly, and see if we can modify the algorithm to give better results.

Things to check:

a. Is the calculation definitely comparing source strings to source strings?
b. If trying to match a short string, will a much larger string which contains the target string receive a suitably high score?  And if so, should we artificially reduce it from 100%? [1]
c. If two strings both contain exact substring matches for a target string, how can we ensure that the shorter string receives a higher similarity score?
d. Is it feasible to highlight the matching trigrams?


[1] translators may assume that a 100% match is safe to re-use as is, but this is not true if only a substring matched.

Comment 1 David Mason 2011-08-15 07:25:47 UTC

Could we run the comparison the other way around for any matched strings to get a second score? Matches that have only a substring would have a high match score but a low reverse-match score. We could then show both scores or an average score, or just have it contribute to sorting.

Comment 2 Sean Flanigan 2011-08-15 07:36:54 UTC

Hmm, I'll have to think about the reverse-matching thing, but that makes me realise something: if it's really similarity, it should have the same value in both directions, ie it should be commutative.  But perhaps we don't want "similarity" so much as "suitability".

Using the reverse score as a secondary sort could help.  It should handle the ordering problem, but by itself wouldn't solve the "100% must be safe" problem.

Comment 3 David Mason 2011-08-15 07:54:38 UTC

Highlighting the parts of the TM source string that match the search string would probably give translators a clearer indication of how good a match the source strings are, and alert them to the fact that the match contains more than just the search string.

Alternatively, the non-matched parts of the string could be shown in red or similar as a clear warning that something isn't right with the matched string.

Comment 4 Hedda Peters 2011-08-16 01:22:33 UTC

Created attachment 518378 [details]
Zanata TM accuracy in short strings

Comment 5 Hedda Peters 2011-08-16 01:23:10 UTC

(In reply to comment #0)
 
> b. If trying to match a short string, will a much larger string which contains
> the target string receive a suitably high score?  And if so, should we
> artificially reduce it from 100%? [1]
> c. If two strings both contain exact substring matches for a target string, how
> can we ensure that the shorter string receives a higher similarity score?

Screenshot attached to demonstrate the current behaviour, two rather short strings, but only one of them should be 100% match.


> d. Is it feasible to highlight the matching trigrams?

I'd like to make this a feature request, albeit not urgent. It is extremely helpful to have the matching parts highlighted. Example use case: An long entry has been changed by the writer slightly since the last version. If the matching parts of the message are highlighted it is easy to spot the *one* word that has changed, rather than having to compare the whole message carefully. The translation memory in Lokalize implements this exact feature and more. Happy to demonstrate.

Comment 6 Sean Flanigan 2011-09-07 04:33:40 UTC

Assigning to Scrum product owner for prioritisation.

Comment 7 Sean Flanigan 2011-11-23 04:17:58 UTC

https://github.com/zanata/zanata/commit/6cdcf30e6c8e0f5b4ffa4c7926aed41ba59b413e (in 1.4 branch) switches to standard Levenshtein distance instead of substring distance, which means that only exact matches should get 100% similarity now.

The diff highlight feature will be tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=756264

Comment 8 David Mason 2012-01-06 04:34:17 UTC

verified in 1.4

Comment 9 David Mason 2012-01-31 06:08:38 UTC

verified in 1.5

Note You need to log in before you can comment on or make changes to this bug.