Description of problem:
My team has been working on a number of Drupal articles and many of them are in Markdown markup language, which needs to be saved as .txt format, and manually pushed to zanata using UI.
When a Markdown document is edited (e.g. one string removed) and re-uploaded to zanata, translations below the point where the change in the source text applied will all be marked as fuzzy. It seems that zanata cannot recognize the string, when such a minor change in position occurs.
Version-Release number of selected component (if applicable): 3.5.1
How reproducible: 100%
Steps to Reproduce:
1.Create a file in Markdown markup language, save it as .txt file, and push it to zanata
2.Complete the translation
3.Edit the source file (e.g. remove one word), and re-upload it to zanata using UI.
All the strings below the section where the change in the source text was made will be marked as fuzzy
It should recognize the previous translations
This is the expected behaviour at the moment. Since there is no canonical identifier for each paragraph in a text document, the position is used instead. Adding or removing a paragraph will change the position number of every paragraph below it. Right now, Zanata interprets this as changing the source text, so it makes the translation fuzzy.
For example, if there is a document on Zanata with source and translations like so:
A -> A'
B -> B'
C -> C'
uploading a new version of the document with the last 2 paragraphs swapped would lead to the following arrangement in Zanata:
A -> A'
C -> B' (fuzzy since source text is different)
B -> C' (fuzzy since source text is different)
We need a more intelligent merge algorithm to handle this in a sensible way. I designed such an algorithm some time ago, and I thought I had a bugzilla entry for it, but I cannot find it at the moment.
The basic approach is to do an initial pass over the document and find the highest similarity match in the current source strings for each string in the newly uploaded source. On the second pass it would:-
- Move strings (and their translations) to the position of their highest match (must be above the match threshold).
- For strings that have a match that is below 100% but above the match threshold, update the source text. This will correctly make their translations fuzzy.
- For new strings that have a high match to an existing source string, but are not the highest match to that source string, copy the translations to the new string as fuzzy translations. e.g. if someone duplicated a paragraph and changed 1 word in it, the changed version would get a fuzzy copy of the translation.
- remove any strings that are not present at a high level match
- add any new strings that do not have a high level match with anything. These will be untranslated in all languages.
In most cases that would mean that adding, removing or rearranging paragraphs without changing their contents would keep translations in their "translated" or "approved" state, and it should also deal reasonably well with the case where paragraphs have been modified in addition to being rearranged.
The threshold for whether an existing string is considered a close match is the main tricky part - if it is set to a low percentage (e.g. 50%), paragraphs might end up with some fuzzy translations that do not fit well, whereas if it is set to a high percentage (e.g. 99%), paragraphs with moderate modifications might lose their translations, rather than keeping them as fuzzy translations. I suspect the appropriate similarity to be between 70% and 90%.
The quicker solution is to just use content hash as the id for every string. The advantage is that it is very little work. The disadvantage is that changing a single character in a paragraph would make it look like a totally new string to Zanata, so it would not show the history of the string and the translations would not be attached to it, so they would have to be added again using translation memory. I consider this an incomplete solution since the solution in the previous comment would provide a much better user experience, but it would be a little better than what we have right now.
As this bug seems to affect many translators for Customer Portal translation, I would mark this bug as high priority and urgent.
I agree with quicker solution David suggested for the time being, as the Zanata Drupal integration will resolve this issue in nearest future.
+1 to quick solution, as the plugin will solve this in the best way possible - once integration is completed
(In reply to Michelle Kim from comment #4)
> as the Zanata Drupal integration will resolve this issue in nearest future.
Note that this issue applies to many of the formats in "File" project type, so the Drupal plugin will only provide a workaround for this very specific case. Many users will still have to deal with poor usability when merging new versions of their documents.
From development team meeting:
Planned fix as follows:
- Use hash for document types that are used for formats that use positional identifiers
- Migrate data: calculate hash for every text flow in every document that is of the types in the previous point.
- Test that download of migrated document translations still works.
- Do a manual release that includes only this change.
Missed in planning: id conflicts
As a side-effect of using hashed content as the id, some strings that previously had different ids based on their position will now end up with the same id, which violates a uniqueness constraint for the resource id.
There is no completely foolproof way to resolve the conflicting resource id, specifically in the case where the conflicting strings have both been translated to a different string (perhaps due to being in a different context) - there is no automated way to ensure that the best translation is chosen.
For now I will aim to resolve these conflicts during migration by just using the string nearest the beginning of the document.
Modified in: https://github.com/zanata/zanata-server/pull/724
Verified (master) at 4ea3f29b862bbe4bf08a12bec629f7aecf43b7f2