Bug 1204526 - RFE: Reuse Translation Memory when only HTML/XML tags are changed
Summary: RFE: Reuse Translation Memory when only HTML/XML tags are changed
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Zanata
Classification: Retired
Component: Component-Logic
Version: development
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Patrick Huang
QA Contact: Zanata-QA Mailling List
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-22 22:50 UTC by Isaac Rooskov
Modified: 2015-08-06 05:55 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-07-29 03:32:12 UTC
Embargoed:


Attachments (Terms of Use)

Description Isaac Rooskov 2015-03-22 22:50:02 UTC
Description of problem:

Documentation is moving away from using XML as the source for their content. Translation will be working on the final HTML versions of docs. 

Currently most strings in the TM have XML tags in them, thus reuse will be heavily reduced when moving to HTML. This means that unless something is done, where we would get approx 90% reuse on docs, we would have to start from 0% - basicly meaning we'd have to start from scratch on building a TM. 

We need to find a way to ensure current TM entries are able to be reused when we move to HTML content, that has HTML tags instead of XML tags. 

Extra info:
Docs is migrating NOW, so it would be great to have this asap.

Comment 1 Michelle Kim 2015-03-22 23:18:42 UTC
Hi Isaac,

Thanks for creating this RFE. I will take this as urgent bug to discuss with the team and let you know once we are done with triage.

Comment 2 David Mason 2015-03-25 01:33:25 UTC
> Documentation is moving away from using XML as the source for their content.
> Translation will be working on the final HTML versions of docs.

Is this a move from docbook xml to the compiled html tags?


> Currently most strings in the TM have XML tags in them, thus reuse will be
> heavily reduced when moving to HTML.

As long as there is still matching content, there should still be a fairly high TM match percentage. Short strings that are made up mainly of xml tags would be the worst affected.

By translation reuse, are you talking about TM and translation copy matches (e.g. TM merge and copytrans), or uploading the new documents and having the translations



Can you give a couple of examples of strings with XML, and the equivalent string in HTML?

Comment 3 Isaac Rooskov 2015-03-30 01:23:11 UTC
Hey David, 

Yes this concerns moving from translating strings where the source is in Docbook XML, to translating the same strings where the source is HTML (as available to customers on access.redhat.com).

My current understanding is that the entries in our TM contain a lot of XML tags, and so when we start translating HTML, previous translations wouldn't show as a match and translators would have to spend a lot of time modifying existing translations for the new HTML format. 

I'm not sure I can answer your translation reuse question any more than providing the content above, however you can speak with Yuko as I know she is already working between docbook XML and the new HTML, so she will be able to better articulate the issues we are thinking we will face.

Comment 4 David Mason 2015-03-30 03:06:52 UTC
Note that the obvious workaround of outputting the translated version of the html and uploading it is not possible because of the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1194543

To allow upload of translations for any format that does not have a natural unique ID for strings, we need to implement the more sophisticated fix mentioned in that bug, and change back to using position for the ID of the strings.

Comment 5 David Mason 2015-03-31 03:12:49 UTC
Notes from discussion with translators:

This is a general problem with switching between formats that mark up text in a different way, and also one that limits copytrans and translation memory matches.


## The Challenge

By way of a short example, suppose there are 3 formats that add emphasis to text in different ways (perhaps docbook, html and markdown):

 - Who <emphasis>are</emphasis> you?
 - Who <strong>are</strong> you?
 - Who **are** you?

A human can tell that these are all the same string, but Zanata in its current form sees them as different (with similarity of ~50%). None of these would give a high match on translation memory for each other. All of Zanata's copy translation features would treat these strings as too different to reuse.


## Solution

The best way to solve this is to make Zanata aware of which parts of each string are translatable text, and which are metadata that should not be modified. The tags and variables would be treated as consistent placeholders whenever checking for similarity, and when uploading a new version of a source string.

Suppose that a project ("Project 1") has this:

source: Who <emphasis>are</emphasis> you?
translation: Wer <emphasis>bist</emphasis> du?

A different project ("Project 2") has this:

source: Who <strong>are</strong> you?
translation: (empty)


A translator in Project 2 selects this string and Zanata does a translation memory (TM) lookup. Zanata should compare the strings using generic placeholders for the tags. I am using [#] here, but the implementation would use code points from the personal use unicode block:

Search string in Project 2:

  Who [1]are[2] you?

    [1]: <strong>
    [2]: </strong>

From Project 1:

  Who [1]are[2] you?

    [1]: <emphasis>
    [2]: </emphasis>

This gives a 100% "different-tags" match. The only match that should sort higher than this in the results would be a match that has the same string and the same tags.

Next, the user copies the TM match from Project 1. Zanata should copy the translation string, but use the tags from the source in Project 2:

Translation from Project 1:

  Wer [1]bist[2] du?

    [1]: <emphasis>
    [2]: </emphasis>

Copied to Project 2 as:

  Wer [1]bist[2] du?

    [1]: <strong>
    [2]: </strong>

  Appearance in editor: Wer <strong>bist</strong> du?


This should be displayed in the editor with the tags visible, but they should generally act like a single character, so the whole tag is selected or deleted at once. Tags could be inserted by autocomplete, drag-and-drop, copy-from-source, copy-from-TM.


## Challenge

The main challenge is identifying which parts of the string should be placeholders. Necessary information:

 - What format does the document use?
 - Which elements are translatable for that format?

If we have the string:

  <a href="http://www.example.com" title="My favourite website.">Example.com</a>

we must determine that it is a HTML string, and know that the title attribute "My favourite website" is translatable, and the content of the <a> tag "Example.com" is translatable, but that the href attribute is not.

Other than these two challenges, everything else is just boilerplate database and UI work to store and display the new representations of strings.


## Other thoughts

There are numerous benefits in addition to better TM matches and reuse:

 - Non-translatable tags/variables could easily be shown with different appearance.
 - Easier for translators to avoid accidentally changing tags and variables.
 - Translators could collapse all tags to a single character so they do not distract from the content.
 - Translation Copy, version merge, etc. could use the same approach to achieve higher 100% match reuse.

Comment 6 Yuko Katabami 2015-04-23 04:52:06 UTC
The new, proposed publishing flow for the English docs will be in the following patterns:


Existing books: DocBook xml => Drupal

Existing books converted: DocBook xml  => AsciiDoc => DocBook xml => Drupal
Newly written book: AsciiDoc => DocBook xml => Drupal

Existing books converted: DocBook xml => Markdown => DocBook xml => Drupal
Newly written books: Markdown => DocBook xml => Drupal

Existing articles converted: Markdown => AsciiDoc => DocBook xml  => Drupal 
(this is a use case for RHELOSP docs that they have a plan to combine articles they wrote in Markdown in the past to form a book)


N.B. Markdown support to be introduced at a later stage, after AsciiDoc.

So it seems that all books and articles will be published from xml.

The push to zanata can then be made as xml, however if it is reverted to xml from other formats, the tags may be different from the ones used originally and they won't match the TM. 

Generic placeholders for the tags might be the solution for this case as well?

Comment 7 David Mason 2015-04-23 23:14:06 UTC
(In reply to Yuko Katabami from comment #6)
> So it seems that all books and articles will be published from xml.
> 
> The push to zanata can then be made as xml, however if it is reverted to xml
> from other formats, the tags may be different from the ones used originally
> and they won't match the TM. 
> 
> Generic placeholders for the tags might be the solution for this case as
> well?

Yes, I think generic placeholders would work well in this case, since it is another case of having tags in the same place, but possibly different tags.

I would expect some proportion of the strings to have "exact same-tags" match (which is the same as 100% match right now), and some proportion that would have "exact different-tags" match (which would not have 100% match right now).

Comment 8 Yuko Katabami 2015-04-24 22:34:30 UTC
It has an added benefit in a case where a writer replaced a tag with a different one. The previous translation can be reused as 100% match, if the generic placeholder is the same between them.

How difficult to implement this?
Is it taking a long time?

Comment 9 Michelle Kim 2015-04-26 23:47:20 UTC
Hi Yuko

Thanks for all your input. I would like to discuss implementation details with the team this wednesday. would you be able to join in? If not, we can discuss first and update this bugzilla for your feedback.

Comment 11 David Mason 2015-05-20 01:39:10 UTC
Prototype doing tag replacement matching.

Assume we have:

 - source with tags
 - translation of that source with the same tags in any order
 - a source that is identical except for tags (same structure and text, but different tag names).

We need this data from translators before we start prototyping.

Prototype generating a translation for the different-tags source based on the tags in the other sources.

After the translation generation, try to make a lucene analyzer to make sure it can behave sensibly.


## Technical notes:

 - We could make a prototype as a set of unit tests that run through different combinations of strings.
 - Zanata's HTML adapter does some manipulation of generic tags that are provided by okapi.

Comment 15 Zanata Migrator 2015-07-29 03:32:12 UTC
Migrated; check JIRA for bug status: http://zanata.atlassian.net/browse/ZNTA-312


Note You need to log in before you can comment on or make changes to this bug.