Bug 805737 - Incorrect / unexpected relative percentages of TM matches
Summary: Incorrect / unexpected relative percentages of TM matches
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Zanata
Classification: Retired
Component: Component-UI
Version: 1.5
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Alex Eng
QA Contact: Zanata-QA Mailling List
URL: https://translate.engineering.redhat....
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-03-21 23:44 UTC by Hedda Peters
Modified: 2015-07-31 01:44 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-07-31 01:44:42 UTC
Embargoed:


Attachments (Terms of Use)
See msgid and TM match 1 & 3 (226.81 KB, image/png)
2012-03-21 23:45 UTC, Hedda Peters
no flags Details
See msgid and TM match 1 & 2 (209.62 KB, image/png)
2012-03-22 00:33 UTC, Hedda Peters
no flags Details
Bizarre example for unexpected TM match percentages (313.75 KB, image/png)
2012-05-09 00:08 UTC, Hedda Peters
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 825202 0 unspecified CLOSED RFE: Translation Memory suggestions should not consider a portion of word. 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 831056 0 medium CLOSED RFE: [Translation Memory] Option for highlight only the search terms 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1111021 0 medium CLOSED Translation Memory search option does not display all the occurrences of a term 2021-02-22 00:41:40 UTC

Internal Links: 825202 831056 1111021

Description Hedda Peters 2012-03-21 23:44:22 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1

I found unexpected relative percentages of TM matches in the following example:

msgid:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol."

TM match 1 with 54%:
"<guilabel>Invalid Subscriptions</guilabel>"

TM match 3 with 37%:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol. These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately."

I would have expected TM match 3 to be higher up the list of matches, since it contains the entire msgid. TM match 1 only contains part of the msgid.

Reproducible: Didn't try

Actual Results:  
The (in my mind) best TM match is on positioin 3 with 37%

Expected Results:  
Should be on position 1

Comment 1 Hedda Peters 2012-03-21 23:45:59 UTC
Created attachment 571889 [details]
See msgid and TM match 1 & 3

Comment 2 Hedda Peters 2012-03-22 00:32:27 UTC
Another good example: 

msgid:
"These systems need attention immediately."

TM match 1 with 39%:
"Start up pdb immediately."

TM match 2 with 32%:
"These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately."


Again, TM match 2 is a very good match, that deserves a higher position / higher percentage.

Comment 3 Hedda Peters 2012-03-22 00:33:23 UTC
Created attachment 571890 [details]
See msgid and TM match 1 & 2

Comment 4 Hedda Peters 2012-05-09 00:07:48 UTC
Please see attached screenshot for an example of really bizarre TM percentages.

TM match #6 contains the entire English string. Even still, other TM matches that contain only piecemeal matches here and there get a higher percentage.

TM match #1-#5 are effectively useless. TM match #6 is a perfectly usable match, but will go unnoticed due to its low position.

May I ask to treat this with a little higher priority than the currently assigned "low". I'm afraid people will learn to disregard TM matches, if they seem useless.

Comment 5 Hedda Peters 2012-05-09 00:08:48 UTC
Created attachment 583083 [details]
Bizarre example for unexpected TM match percentages

Comment 6 Sean Flanigan 2012-06-25 07:38:29 UTC
We need to do a better job of identifying long substrings that exactly match the query, and weighting them appropriately.

Comment 7 Sean Flanigan 2012-07-31 01:39:05 UTC
Note that Zanata 1.7 uses a new algorithm for calculating similarity, based on words.  See bug 825202

Comment 8 Hedda Peters 2012-07-31 01:40:45 UTC
(In reply to comment #7)
> Note that Zanata 1.7 uses a new algorithm for calculating similarity, based
> on words.  See bug 825202


Oh great!!  I'll keep an eye on that.

Thanks

Comment 9 Michelle Kim 2012-11-13 06:38:53 UTC
Hi Hedda,

Another bug triage effort from me :-) if you don't mind letting us know if this bug if it is still an issue in 2.0?

Comment 10 Hedda Peters 2012-11-14 00:17:59 UTC
Michelle, Zanata 2.0 didn't display any TM results lower than 100% at first. (Not sure if that's fixed yet - in some projects I still see only 100% or no results, other projects also show the occasional lower % result)

I therefore can't tell you one way or the other at the moment, I will have to keep an eye on it once the TM shows several results again with different percentages.

Comment 11 Michelle Kim 2012-11-14 23:52:23 UTC
Thanks Hedda, Please keep an eye on and see if this happens again in 2.0

Comment 13 Damian Jansen 2014-02-28 05:45:07 UTC
Retested at 54d204020b600be1e8e3c1a9a357a0e02e832861

I'm going to mark this as good-to-go.

Comment 14 Hedda Peters 2014-09-23 23:17:03 UTC
May I reopen this BZ?

Not impressed by a TM result I found in the current 3.4.2 version.
What clearly appears to be the best TM match only get 33% similarity value, while a match not as good gets 40%.

Screenshot to follow.

Comment 16 Michelle Kim 2015-02-24 01:33:06 UTC
Hi Hedda,

Have you seen this happen recently again? We will look into it.

Comment 17 Hedda Peters 2015-02-24 02:03:45 UTC
Michelle,

I am not taking notice of the TM percentages anymore - that is my learning of this bug being open for almost three years now. 

The TM has overall taken a backseat for me because of these and other issues. It is not as usable as it should be, so consequently I dont use it as much anymore.

Comment 18 Michelle Kim 2015-02-25 03:56:33 UTC
Hi Hedda,

I will take the chance to encourage the Team to revisit and research the TM algorithm again to see if we can improve any accuracy in TM percentage.

Comment 19 Damian Jansen 2015-03-30 04:57:35 UTC
Upgrading to high - quite damaging to the concept of TM matching.

Comment 20 Patrick Huang 2015-06-01 23:31:45 UTC
Just found a bug that's related to TM similarity percentage algorithm.

Let's say we have two strings:
- "you.what?yo"
- "you. what? yo..."

Current algorithm will give 0% match because we tokenize the string by punctuation PLUS a space. In cases where users left out the space (regardless intention here), this will cause completely incorrect result.

If I change the split regex from "[,.]?[\s]+" to "[,\.\?!\s]+[\s]?", above example will show 100%. There are more punctuation we need to cover (e.g. ";", ")", "(", "-" etc).

Hedda's original problem in bug description is a different issue. We may tackle it differently. Given the example:
msgid:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol."

TM match 1 with 54%:
"<guilabel>Invalid Subscriptions</guilabel>"

TM match 3 with 37%:
"<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol. These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately."

If we tokenize the upcoming message by words, and if a TM can provide translation for all the words in that message (as in TM match 3), we should make it with higher percentage (if not 100% at least somewhere close to it if it's not exact substring match). This requires a bit of post processing. Maybe only do it when we can find exact substring match.

Comment 21 David Mason 2015-06-11 03:00:34 UTC
> If I change the split regex from "[,.]?[\s]+" to "[,\.\?!\s]+[\s]?", above example will show 100%. There are more punctuation we need to cover (e.g. ";", ")", "(", "-" etc).

I think we need more work on the regex to make sure it behaves how we expect. An important step is to define a thorough set of example strings the cover all our common supported forms of strings, and put them in unit tests to make sure that any regex change fits with what we expect.

A few notes:

 - the original problem reported in this bug would be solved by matching '<' and '>' as a break-point, or by matching an entire tag as a break point (but then we need to be very careful to make sure it works sensibly for a sting like "Home > About"

 - A URL should ideally be a single token (the above would break it on the '.'s in the domain name.

Comment 22 Luke Brooker 2015-06-11 03:22:57 UTC
This bug is also relevant

Bug 1111021

https://bugzilla.redhat.com/show_bug.cgi?id=1111021

Comment 23 Zanata Migrator 2015-07-31 01:44:42 UTC
Migrated; check JIRA for bug status: http://zanata.atlassian.net/browse/ZNTA-523


Note You need to log in before you can comment on or make changes to this bug.