Bug 730189
Summary: | Overhaul translation memory similarity algorithm | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Zanata | Reporter: | Sean Flanigan <sflaniga> | ||||
Component: | Component-Logic | Assignee: | zanata-dev-internal <zanata-dev-internal> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ding-Yi Chen <dchen> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 1.3 | CC: | damason, hpeters, mgiri, runab, zanata-bugs | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-04-23 04:33:18 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Sean Flanigan
2011-08-12 04:55:29 UTC
Could we run the comparison the other way around for any matched strings to get a second score? Matches that have only a substring would have a high match score but a low reverse-match score. We could then show both scores or an average score, or just have it contribute to sorting. Hmm, I'll have to think about the reverse-matching thing, but that makes me realise something: if it's really similarity, it should have the same value in both directions, ie it should be commutative. But perhaps we don't want "similarity" so much as "suitability". Using the reverse score as a secondary sort could help. It should handle the ordering problem, but by itself wouldn't solve the "100% must be safe" problem. Highlighting the parts of the TM source string that match the search string would probably give translators a clearer indication of how good a match the source strings are, and alert them to the fact that the match contains more than just the search string. Alternatively, the non-matched parts of the string could be shown in red or similar as a clear warning that something isn't right with the matched string. Created attachment 518378 [details]
Zanata TM accuracy in short strings
(In reply to comment #0) > b. If trying to match a short string, will a much larger string which contains > the target string receive a suitably high score? And if so, should we > artificially reduce it from 100%? [1] > c. If two strings both contain exact substring matches for a target string, how > can we ensure that the shorter string receives a higher similarity score? Screenshot attached to demonstrate the current behaviour, two rather short strings, but only one of them should be 100% match. > d. Is it feasible to highlight the matching trigrams? I'd like to make this a feature request, albeit not urgent. It is extremely helpful to have the matching parts highlighted. Example use case: An long entry has been changed by the writer slightly since the last version. If the matching parts of the message are highlighted it is easy to spot the *one* word that has changed, rather than having to compare the whole message carefully. The translation memory in Lokalize implements this exact feature and more. Happy to demonstrate. Assigning to Scrum product owner for prioritisation. https://github.com/zanata/zanata/commit/6cdcf30e6c8e0f5b4ffa4c7926aed41ba59b413e (in 1.4 branch) switches to standard Levenshtein distance instead of substring distance, which means that only exact matches should get 100% similarity now. The diff highlight feature will be tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=756264 verified in 1.4 verified in 1.5 |