Bug 1194543 - Manual document re-upload makes previous translations fuzzy
Summary: Manual document re-upload makes previous translations fuzzy
Alias: None
Product: Zanata
Classification: Retired
Component: Component-CopyTrans
Version: unspecified
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.6
Assignee: David Mason
QA Contact: Damian Jansen
Depends On:
Blocks: Zanata-3.6.1
TreeView+ depends on / blocked
Reported: 2015-02-20 06:36 UTC by Yuko Katabami
Modified: 2015-04-20 00:29 UTC (History)
6 users (show)

Fixed In Version: 3.7.0-SNAPSHOT (git-jenkins-zanata-server-github-pull-requests-3023)
Doc Type: Bug Fix
Doc Text:
Story Points: 3
Clone Of:
Last Closed: 2015-04-20 00:29:46 UTC

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 873489 0 medium CLOSED XLIFF/Properties/PO upload should check that translations correspond to the current text flow contents 2021-02-22 00:41:40 UTC

Internal Links: 873489

Description Yuko Katabami 2015-02-20 06:36:33 UTC
Description of problem:
My team has been working on a number of Drupal articles and many of them are in Markdown markup language, which needs to be saved as .txt format, and manually pushed to zanata using UI.

When a Markdown document is edited (e.g. one string removed) and re-uploaded to zanata, translations below the point where the change in the source text applied will all be marked as fuzzy. It seems that zanata cannot recognize the string, when such a minor change in position occurs.

Version-Release number of selected component (if applicable): 3.5.1

How reproducible: 100%

Steps to Reproduce:
1.Create a file in Markdown markup language, save it as .txt file, and push it to zanata
2.Complete the translation
3.Edit the source file (e.g. remove one word), and re-upload it to zanata using UI.

Actual results:
All the strings below the section where the change in the source text was made will be marked as fuzzy

Expected results:
It should recognize the previous translations

Additional info:

Comment 2 David Mason 2015-02-24 02:11:52 UTC
This is the expected behaviour at the moment. Since there is no canonical identifier for each paragraph in a text document, the position is used instead. Adding or removing a paragraph will change the position number of every paragraph below it. Right now, Zanata interprets this as changing the source text, so it makes the translation fuzzy.

For example, if there is a document on Zanata with source and translations like so:

 A -> A'
 B -> B'
 C -> C'

uploading a new version of the document with the last 2 paragraphs swapped would lead to the following arrangement in Zanata:

 A -> A'
 C -> B' (fuzzy since source text is different)
 B -> C' (fuzzy since source text is different)

We need a more intelligent merge algorithm to handle this in a sensible way. I designed such an algorithm some time ago, and I thought I had a bugzilla entry for it, but I cannot find it at the moment.

The basic approach is to do an initial pass over the document and find the highest similarity match in the current source strings for each string in the newly uploaded source. On the second pass it would:-

 - Move strings (and their translations) to the position of their highest match (must be above the match threshold).
 - For strings that have a match that is below 100% but above the match threshold, update the source text. This will correctly make their translations fuzzy.
 - For new strings that have a high match to an existing source string, but are not the highest match to that source string, copy the translations to the new string as fuzzy translations. e.g. if someone duplicated a paragraph and changed 1 word in it, the changed version would get a fuzzy copy of the translation.
 - remove any strings that are not present at a high level match
 - add any new strings that do not have a high level match with anything. These will be untranslated in all languages.

In most cases that would mean that adding, removing or rearranging paragraphs without changing their contents would keep translations in their "translated" or "approved" state, and it should also deal reasonably well with the case where paragraphs have been modified in addition to being rearranged.

The threshold for whether an existing string is considered a close match is the main tricky part - if it is set to a low percentage (e.g. 50%), paragraphs might end up with some fuzzy translations that do not fit well, whereas if it is set to a high percentage (e.g. 99%), paragraphs with moderate modifications might lose their translations, rather than keeping them as fuzzy translations. I suspect the appropriate similarity to be between 70% and 90%.

Comment 3 David Mason 2015-02-24 02:36:28 UTC
The quicker solution is to just use content hash as the id for every string. The advantage is that it is very little work. The disadvantage is that changing a single character in a paragraph would make it look like a totally new string to Zanata, so it would not show the history of the string and the translations would not be attached to it, so they would have to be added again using translation memory. I consider this an incomplete solution since the solution in the previous comment would provide a much better user experience, but it would be a little better than what we have right now.

Comment 4 Michelle Kim 2015-02-24 03:14:51 UTC
As this bug seems to affect many translators for Customer Portal translation, I would mark this bug as high priority and urgent.

I agree with quicker solution David suggested for the time being, as the Zanata Drupal integration will resolve this issue in nearest future.

Comment 5 Isaac Rooskov 2015-02-24 03:23:31 UTC
+1 to quick solution, as the plugin will solve this in the best way possible - once integration is completed

Comment 6 David Mason 2015-02-24 04:16:39 UTC
(In reply to Michelle Kim from comment #4)
> as the Zanata Drupal integration will resolve this issue in nearest future.

Note that this issue applies to many of the formats in "File" project type, so the Drupal plugin will only provide a workaround for this very specific case. Many users will still have to deal with poor usability when merging new versions of their documents.

Comment 7 David Mason 2015-02-25 02:17:06 UTC
From development team meeting:

Planned fix as follows:

 - Use hash for document types that are used for formats that use positional identifiers
 - Migrate data: calculate hash for every text flow in every document that is of the types in the previous point.
 - Test that download of migrated document translations still works.
 - Do a manual release that includes only this change.

Comment 8 David Mason 2015-03-10 08:52:26 UTC
Missed in planning: id conflicts

As a side-effect of using hashed content as the id, some strings that previously had different ids based on their position will now end up with the same id, which violates a uniqueness constraint for the resource id.

There is no completely foolproof way to resolve the conflicting resource id, specifically in the case where the conflicting strings have both been translated to a different string (perhaps due to being in a different context) - there is no automated way to ensure that the best translation is chosen.

For now I will aim to resolve these conflicts during migration by just using the string nearest the beginning of the document.

Comment 9 David Mason 2015-03-11 11:33:28 UTC
Modified in: https://github.com/zanata/zanata-server/pull/724

Comment 10 Damian Jansen 2015-03-17 05:53:50 UTC
Verified (master) at 4ea3f29b862bbe4bf08a12bec629f7aecf43b7f2

Note You need to log in before you can comment on or make changes to this bug.