User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1 I found unexpected relative percentages of TM matches in the following example: msgid: "<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol." TM match 1 with 54%: "<guilabel>Invalid Subscriptions</guilabel>" TM match 3 with 37%: "<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol. These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately." I would have expected TM match 3 to be higher up the list of matches, since it contains the entire msgid. TM match 1 only contains part of the msgid. Reproducible: Didn't try Actual Results: The (in my mind) best TM match is on positioin 3 with 37% Expected Results: Should be on position 1
Created attachment 571889 [details] See msgid and TM match 1 & 3
Another good example: msgid: "These systems need attention immediately." TM match 1 with 39%: "Start up pdb immediately." TM match 2 with 32%: "These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately." Again, TM match 2 is a very good match, that deserves a higher position / higher percentage.
Created attachment 571890 [details] See msgid and TM match 1 & 2
Please see attached screenshot for an example of really bizarre TM percentages. TM match #6 contains the entire English string. Even still, other TM matches that contain only piecemeal matches here and there get a higher percentage. TM match #1-#5 are effectively useless. TM match #6 is a perfectly usable match, but will go unnoticed due to its low position. May I ask to treat this with a little higher priority than the currently assigned "low". I'm afraid people will learn to disregard TM matches, if they seem useless.
Created attachment 583083 [details] Bizarre example for unexpected TM match percentages
We need to do a better job of identifying long substrings that exactly match the query, and weighting them appropriately.
Note that Zanata 1.7 uses a new algorithm for calculating similarity, based on words. See bug 825202
(In reply to comment #7) > Note that Zanata 1.7 uses a new algorithm for calculating similarity, based > on words. See bug 825202 Oh great!! I'll keep an eye on that. Thanks
Hi Hedda, Another bug triage effort from me :-) if you don't mind letting us know if this bug if it is still an issue in 2.0?
Michelle, Zanata 2.0 didn't display any TM results lower than 100% at first. (Not sure if that's fixed yet - in some projects I still see only 100% or no results, other projects also show the occasional lower % result) I therefore can't tell you one way or the other at the moment, I will have to keep an eye on it once the TM shows several results again with different percentages.
Thanks Hedda, Please keep an eye on and see if this happens again in 2.0
This paper may be relevant for researching the problem: http://web.archive.org/web/20070824111435/http://compbio.cs.sfu.ca/publications/icalp.cameraready.pdf https://docs.google.com/viewer?a=v&q=cache:EbxwMq0buMsJ:compbio.cs.sfu.ca/publications/icalp.cameraready.pdf+&hl=en&gl=au&pid=bl&srcid=ADGEESiKSXi4N3Avmul1Hdd_uiX6ej27QQ4MUogQ9aI5681pz9VbqW0HyOUf6lly5PufGsW8Lu5HpLwncxZS9LyvmnEWDabMwod9mncm6W47hgSKi8dNBmwn-Xvd5KHym-tm9ApzfHe5&sig=AHIEtbRpOabQNNmKU-52Q0dhKrbvy-YGiQ
Retested at 54d204020b600be1e8e3c1a9a357a0e02e832861 I'm going to mark this as good-to-go.
May I reopen this BZ? Not impressed by a TM result I found in the current 3.4.2 version. What clearly appears to be the best TM match only get 33% similarity value, while a match not as good gets 40%. Screenshot to follow.
Hi Hedda, Have you seen this happen recently again? We will look into it.
Michelle, I am not taking notice of the TM percentages anymore - that is my learning of this bug being open for almost three years now. The TM has overall taken a backseat for me because of these and other issues. It is not as usable as it should be, so consequently I dont use it as much anymore.
Hi Hedda, I will take the chance to encourage the Team to revisit and research the TM algorithm again to see if we can improve any accuracy in TM percentage.
Upgrading to high - quite damaging to the concept of TM matching.
Just found a bug that's related to TM similarity percentage algorithm. Let's say we have two strings: - "you.what?yo" - "you. what? yo..." Current algorithm will give 0% match because we tokenize the string by punctuation PLUS a space. In cases where users left out the space (regardless intention here), this will cause completely incorrect result. If I change the split regex from "[,.]?[\s]+" to "[,\.\?!\s]+[\s]?", above example will show 100%. There are more punctuation we need to cover (e.g. ";", ")", "(", "-" etc). Hedda's original problem in bug description is a different issue. We may tackle it differently. Given the example: msgid: "<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol." TM match 1 with 54%: "<guilabel>Invalid Subscriptions</guilabel>" TM match 3 with 37%: "<guilabel>Invalid Subscriptions</guilabel>: indicated by a red square symbol. These are systems that have products installed, but have not consumed a subscription. These systems need attention immediately." If we tokenize the upcoming message by words, and if a TM can provide translation for all the words in that message (as in TM match 3), we should make it with higher percentage (if not 100% at least somewhere close to it if it's not exact substring match). This requires a bit of post processing. Maybe only do it when we can find exact substring match.
> If I change the split regex from "[,.]?[\s]+" to "[,\.\?!\s]+[\s]?", above example will show 100%. There are more punctuation we need to cover (e.g. ";", ")", "(", "-" etc). I think we need more work on the regex to make sure it behaves how we expect. An important step is to define a thorough set of example strings the cover all our common supported forms of strings, and put them in unit tests to make sure that any regex change fits with what we expect. A few notes: - the original problem reported in this bug would be solved by matching '<' and '>' as a break-point, or by matching an entire tag as a break point (but then we need to be very careful to make sure it works sensibly for a sting like "Home > About" - A URL should ideally be a single token (the above would break it on the '.'s in the domain name.
This bug is also relevant Bug 1111021 https://bugzilla.redhat.com/show_bug.cgi?id=1111021
Migrated; check JIRA for bug status: http://zanata.atlassian.net/browse/ZNTA-523