Bug 805737

Summary: Incorrect / unexpected relative percentages of TM matches
Product: [Retired] Zanata Reporter: Hedda Peters <hpeters>
Component: Component-UIAssignee: Alex Eng <aeng>
Status: CLOSED UPSTREAM QA Contact: Zanata-QA Mailling List <zanata-qa>
Severity: high Docs Contact:
Priority: high    
Version: 1.5CC: damason, djansen, lbrooker, mkim, pahuang, sflaniga, zanata-bugs
Target Milestone: ---Keywords: screened
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
URL: https://translate.engineering.redhat.com/webtrans/Application.html?project=sam&iteration=1.0&localeId=de&locale=de#view:doc;doc:topics/Concepts/Cloud/SAM_Dashboard
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-31 01:44:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
See msgid and TM match 1 & 3
none
See msgid and TM match 1 & 2
none
Bizarre example for unexpected TM match percentages none

Description Hedda Peters 2012-03-21 23:44:22 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1

I found unexpected relative percentages of TM matches in the following example:

msgid:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol."

TM match 1 with 54%:
"<guilabel>Invalid Subscriptions</guilabel>"

TM match 3 with 37%:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol. These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately."

I would have expected TM match 3 to be higher up the list of matches, since it contains the entire msgid. TM match 1 only contains part of the msgid.

Reproducible: Didn't try

Actual Results:  
The (in my mind) best TM match is on positioin 3 with 37%

Expected Results:  
Should be on position 1

Comment 1 Hedda Peters 2012-03-21 23:45:59 UTC
Created attachment 571889 [details]
See msgid and TM match 1 & 3

Comment 2 Hedda Peters 2012-03-22 00:32:27 UTC
Another good example: 

msgid:
"These systems need attention immediately."

TM match 1 with 39%:
"Start up pdb immediately."

TM match 2 with 32%:
"These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately."


Again, TM match 2 is a very good match, that deserves a higher position / higher percentage.

Comment 3 Hedda Peters 2012-03-22 00:33:23 UTC
Created attachment 571890 [details]
See msgid and TM match 1 & 2

Comment 4 Hedda Peters 2012-05-09 00:07:48 UTC
Please see attached screenshot for an example of really bizarre TM percentages.

TM match #6 contains the entire English string. Even still, other TM matches that contain only piecemeal matches here and there get a higher percentage.

TM match #1-#5 are effectively useless. TM match #6 is a perfectly usable match, but will go unnoticed due to its low position.

May I ask to treat this with a little higher priority than the currently assigned "low". I'm afraid people will learn to disregard TM matches, if they seem useless.

Comment 5 Hedda Peters 2012-05-09 00:08:48 UTC
Created attachment 583083 [details]
Bizarre example for unexpected TM match percentages

Comment 6 Sean Flanigan 2012-06-25 07:38:29 UTC
We need to do a better job of identifying long substrings that exactly match the query, and weighting them appropriately.

Comment 7 Sean Flanigan 2012-07-31 01:39:05 UTC
Note that Zanata 1.7 uses a new algorithm for calculating similarity, based on words.  See bug 825202

Comment 8 Hedda Peters 2012-07-31 01:40:45 UTC
(In reply to comment #7)
> Note that Zanata 1.7 uses a new algorithm for calculating similarity, based
> on words.  See bug 825202


Oh great!!  I'll keep an eye on that.

Thanks

Comment 9 Michelle Kim 2012-11-13 06:38:53 UTC
Hi Hedda,

Another bug triage effort from me :-) if you don't mind letting us know if this bug if it is still an issue in 2.0?

Comment 10 Hedda Peters 2012-11-14 00:17:59 UTC
Michelle, Zanata 2.0 didn't display any TM results lower than 100% at first. (Not sure if that's fixed yet - in some projects I still see only 100% or no results, other projects also show the occasional lower % result)

I therefore can't tell you one way or the other at the moment, I will have to keep an eye on it once the TM shows several results again with different percentages.

Comment 11 Michelle Kim 2012-11-14 23:52:23 UTC
Thanks Hedda, Please keep an eye on and see if this happens again in 2.0

Comment 13 Damian Jansen 2014-02-28 05:45:07 UTC
Retested at 54d204020b600be1e8e3c1a9a357a0e02e832861

I'm going to mark this as good-to-go.

Comment 14 Hedda Peters 2014-09-23 23:17:03 UTC
May I reopen this BZ?

Not impressed by a TM result I found in the current 3.4.2 version.
What clearly appears to be the best TM match only get 33% similarity value, while a match not as good gets 40%.

Screenshot to follow.

Comment 16 Michelle Kim 2015-02-24 01:33:06 UTC
Hi Hedda,

Have you seen this happen recently again? We will look into it.

Comment 17 Hedda Peters 2015-02-24 02:03:45 UTC
Michelle,

I am not taking notice of the TM percentages anymore - that is my learning of this bug being open for almost three years now. 

The TM has overall taken a backseat for me because of these and other issues. It is not as usable as it should be, so consequently I dont use it as much anymore.

Comment 18 Michelle Kim 2015-02-25 03:56:33 UTC
Hi Hedda,

I will take the chance to encourage the Team to revisit and research the TM algorithm again to see if we can improve any accuracy in TM percentage.

Comment 19 Damian Jansen 2015-03-30 04:57:35 UTC
Upgrading to high - quite damaging to the concept of TM matching.

Comment 20 Patrick Huang 2015-06-01 23:31:45 UTC
Just found a bug that's related to TM similarity percentage algorithm.

Let's say we have two strings:
- "you.what?yo"
- "you. what? yo..."

Current algorithm will give 0% match because we tokenize the string by punctuation PLUS a space. In cases where users left out the space (regardless intention here), this will cause completely incorrect result.

If I change the split regex from "[,.]?[\s]+" to "[,\.\?!\s]+[\s]?", above example will show 100%. There are more punctuation we need to cover (e.g. ";", ")", "(", "-" etc).

Hedda's original problem in bug description is a different issue. We may tackle it differently. Given the example:
msgid:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol."

TM match 1 with 54%:
"<guilabel>Invalid Subscriptions</guilabel>"

TM match 3 with 37%:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol. These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately."

If we tokenize the upcoming message by words, and if a TM can provide translation for all the words in that message (as in TM match 3), we should make it with higher percentage (if not 100% at least somewhere close to it if it's not exact substring match). This requires a bit of post processing. Maybe only do it when we can find exact substring match.

Comment 21 David Mason 2015-06-11 03:00:34 UTC
> If I change the split regex from "[,.]?[\s]+" to "[,\.\?!\s]+[\s]?", above example will show 100%. There are more punctuation we need to cover (e.g. ";", ")", "(", "-" etc).

I think we need more work on the regex to make sure it behaves how we expect. An important step is to define a thorough set of example strings the cover all our common supported forms of strings, and put them in unit tests to make sure that any regex change fits with what we expect.

A few notes:

 - the original problem reported in this bug would be solved by matching '<' and '>' as a break-point, or by matching an entire tag as a break point (but then we need to be very careful to make sure it works sensibly for a sting like "Home > About"

 - A URL should ideally be a single token (the above would break it on the '.'s in the domain name.

Comment 22 Luke Brooker 2015-06-11 03:22:57 UTC
This bug is also relevant

Bug 1111021

https://bugzilla.redhat.com/show_bug.cgi?id=1111021

Comment 23 Zanata Migrator 2015-07-31 01:44:42 UTC
Migrated; check JIRA for bug status: http://zanata.atlassian.net/browse/ZNTA-523