This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 805737 - Incorrect / unexpected relative percentages of TM matches
Incorrect / unexpected relative percentages of TM matches
Status: CLOSED UPSTREAM
Product: Zanata
Classification: Community
Component: Component-UI (Show other bugs)
1.5
Unspecified Linux
high Severity high
: ---
: ---
Assigned To: Alex Eng
Zanata-QA Mailling List
https://translate.engineering.redhat....
: screened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-21 19:44 EDT by Hedda Peters
Modified: 2015-07-30 21:44 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-07-30 21:44:42 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
See msgid and TM match 1 & 3 (226.81 KB, image/png)
2012-03-21 19:45 EDT, Hedda Peters
no flags Details
See msgid and TM match 1 & 2 (209.62 KB, image/png)
2012-03-21 20:33 EDT, Hedda Peters
no flags Details
Bizarre example for unexpected TM match percentages (313.75 KB, image/png)
2012-05-08 20:08 EDT, Hedda Peters
no flags Details

  None (edit)
Description Hedda Peters 2012-03-21 19:44:22 EDT
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1

I found unexpected relative percentages of TM matches in the following example:

msgid:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol."

TM match 1 with 54%:
"<guilabel>Invalid Subscriptions</guilabel>"

TM match 3 with 37%:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol. These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately."

I would have expected TM match 3 to be higher up the list of matches, since it contains the entire msgid. TM match 1 only contains part of the msgid.

Reproducible: Didn't try

Actual Results:  
The (in my mind) best TM match is on positioin 3 with 37%

Expected Results:  
Should be on position 1
Comment 1 Hedda Peters 2012-03-21 19:45:59 EDT
Created attachment 571889 [details]
See msgid and TM match 1 & 3
Comment 2 Hedda Peters 2012-03-21 20:32:27 EDT
Another good example: 

msgid:
"These systems need attention immediately."

TM match 1 with 39%:
"Start up pdb immediately."

TM match 2 with 32%:
"These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately."


Again, TM match 2 is a very good match, that deserves a higher position / higher percentage.
Comment 3 Hedda Peters 2012-03-21 20:33:23 EDT
Created attachment 571890 [details]
See msgid and TM match 1 & 2
Comment 4 Hedda Peters 2012-05-08 20:07:48 EDT
Please see attached screenshot for an example of really bizarre TM percentages.

TM match #6 contains the entire English string. Even still, other TM matches that contain only piecemeal matches here and there get a higher percentage.

TM match #1-#5 are effectively useless. TM match #6 is a perfectly usable match, but will go unnoticed due to its low position.

May I ask to treat this with a little higher priority than the currently assigned "low". I'm afraid people will learn to disregard TM matches, if they seem useless.
Comment 5 Hedda Peters 2012-05-08 20:08:48 EDT
Created attachment 583083 [details]
Bizarre example for unexpected TM match percentages
Comment 6 Sean Flanigan 2012-06-25 03:38:29 EDT
We need to do a better job of identifying long substrings that exactly match the query, and weighting them appropriately.
Comment 7 Sean Flanigan 2012-07-30 21:39:05 EDT
Note that Zanata 1.7 uses a new algorithm for calculating similarity, based on words.  See bug 825202
Comment 8 Hedda Peters 2012-07-30 21:40:45 EDT
(In reply to comment #7)
> Note that Zanata 1.7 uses a new algorithm for calculating similarity, based
> on words.  See bug 825202


Oh great!!  I'll keep an eye on that.

Thanks
Comment 9 Michelle Kim 2012-11-13 01:38:53 EST
Hi Hedda,

Another bug triage effort from me :-) if you don't mind letting us know if this bug if it is still an issue in 2.0?
Comment 10 Hedda Peters 2012-11-13 19:17:59 EST
Michelle, Zanata 2.0 didn't display any TM results lower than 100% at first. (Not sure if that's fixed yet - in some projects I still see only 100% or no results, other projects also show the occasional lower % result)

I therefore can't tell you one way or the other at the moment, I will have to keep an eye on it once the TM shows several results again with different percentages.
Comment 11 Michelle Kim 2012-11-14 18:52:23 EST
Thanks Hedda, Please keep an eye on and see if this happens again in 2.0
Comment 13 Damian Jansen 2014-02-28 00:45:07 EST
Retested at 54d204020b600be1e8e3c1a9a357a0e02e832861

I'm going to mark this as good-to-go.
Comment 14 Hedda Peters 2014-09-23 19:17:03 EDT
May I reopen this BZ?

Not impressed by a TM result I found in the current 3.4.2 version.
What clearly appears to be the best TM match only get 33% similarity value, while a match not as good gets 40%.

Screenshot to follow.
Comment 16 Michelle Kim 2015-02-23 20:33:06 EST
Hi Hedda,

Have you seen this happen recently again? We will look into it.
Comment 17 Hedda Peters 2015-02-23 21:03:45 EST
Michelle,

I am not taking notice of the TM percentages anymore - that is my learning of this bug being open for almost three years now. 

The TM has overall taken a backseat for me because of these and other issues. It is not as usable as it should be, so consequently I dont use it as much anymore.
Comment 18 Michelle Kim 2015-02-24 22:56:33 EST
Hi Hedda,

I will take the chance to encourage the Team to revisit and research the TM algorithm again to see if we can improve any accuracy in TM percentage.
Comment 19 Damian Jansen 2015-03-30 00:57:35 EDT
Upgrading to high - quite damaging to the concept of TM matching.
Comment 20 Patrick Huang 2015-06-01 19:31:45 EDT
Just found a bug that's related to TM similarity percentage algorithm.

Let's say we have two strings:
- "you.what?yo"
- "you. what? yo..."

Current algorithm will give 0% match because we tokenize the string by punctuation PLUS a space. In cases where users left out the space (regardless intention here), this will cause completely incorrect result.

If I change the split regex from "[,.]?[\s]+" to "[,\.\?!\s]+[\s]?", above example will show 100%. There are more punctuation we need to cover (e.g. ";", ")", "(", "-" etc).

Hedda's original problem in bug description is a different issue. We may tackle it differently. Given the example:
msgid:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol."

TM match 1 with 54%:
"<guilabel>Invalid Subscriptions</guilabel>"

TM match 3 with 37%:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol. These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately."

If we tokenize the upcoming message by words, and if a TM can provide translation for all the words in that message (as in TM match 3), we should make it with higher percentage (if not 100% at least somewhere close to it if it's not exact substring match). This requires a bit of post processing. Maybe only do it when we can find exact substring match.
Comment 21 David Mason 2015-06-10 23:00:34 EDT
> If I change the split regex from "[,.]?[\s]+" to "[,\.\?!\s]+[\s]?", above example will show 100%. There are more punctuation we need to cover (e.g. ";", ")", "(", "-" etc).

I think we need more work on the regex to make sure it behaves how we expect. An important step is to define a thorough set of example strings the cover all our common supported forms of strings, and put them in unit tests to make sure that any regex change fits with what we expect.

A few notes:

 - the original problem reported in this bug would be solved by matching '<' and '>' as a break-point, or by matching an entire tag as a break point (but then we need to be very careful to make sure it works sensibly for a sting like "Home > About"

 - A URL should ideally be a single token (the above would break it on the '.'s in the domain name.
Comment 22 Luke Brooker 2015-06-10 23:22:57 EDT
This bug is also relevant

Bug 1111021

https://bugzilla.redhat.com/show_bug.cgi?id=1111021
Comment 23 Zanata Migrator 2015-07-30 21:44:42 EDT
Migrated; check JIRA for bug status: http://zanata.atlassian.net/browse/ZNTA-523

Note You need to log in before you can comment on or make changes to this bug.