Bug 1057420

Summary: REST APIs generate illegal XML when files contain invalid characters like 0x1b, 0x08
Product: [Retired] Zanata Reporter: Patrick Huang <pahuang>
Component: Component-Maven, Component-Logic, Component-PythonClient, Component-zanata-clientAssignee: Patrick Huang <pahuang>
Status: CLOSED UPSTREAM QA Contact: Zanata-QA Mailling List <zanata-qa>
Severity: high Docs Contact:
Priority: medium    
Version: 3.1CC: ccheng, djansen, sflaniga, sshedmak, zanata-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-31 01:47:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Test file none

Description Patrick Huang 2014-01-24 04:42:36 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.63 Safari/537.31
Build Identifier: 

When using zanata maven client to push or pull, if one text flow contains Unicode character: 0x1b, resteasy marshalling/unmarshalling will fail. But upload through server UI will not suffer from this problem.

Reproducible: Always

Steps to Reproduce:
1. create a gettext project/version
2. mvn zanata:push 


Actual Results:  
org.jboss.resteasy.plugins.providers.jaxb.JAXBUnmarshalException: javax.xml.bind.UnmarshalException
 - with linked exception:
[org.xml.sax.SAXParseException; lineNumber: 325; columnNumber: 7; An invalid XML character (Unicode: 0x1b) was found in the element content of the document.]


Expected Results:  
push ok

Server resteasy version is different from client.

Comment 1 Patrick Huang 2014-01-24 04:43:40 UTC
Created attachment 854745 [details]
Test file

This is a cut down version of production file.

Comment 2 Sean Flanigan 2014-01-28 01:48:29 UTC
Patrick, regarding the ESCAPE char handling, see https://java.net/jira/browse/JAXB-614 .  Note that the workaround in the first comment is no good to us, because it irreversibly converts all illegal chars into the same char.

It seems that there is no good way of representing control characters in plain XML, even with CDATA (apparently).  We either need an alternative/extended XML schema for our REST service which performs (eg) base64 encoding/decoding for control chars, or we could just push POT/PO files directly to Zanata for processing on the server, thus bypassing the XML stage entirely.  

In the meantime, we should make sure we detect these control characters before JAXB goes and generates illegal XML.

As a workaround, I would recommend separating the non-translatable text[1] from the translatable text (eg names of colours).  Escape characters and ANSI sequences are very likely to be difficult for translators to deal with anyway, because the editor may not be able to show the escape character very well.

[1] including ANSI codes, any other control codes and command-line keywords like "status-line" and "save-confirmation"

Comment 3 Patrick Huang 2014-04-27 23:11:40 UTC
if we use json instead will it help?

Comment 4 Sean Flanigan 2014-04-28 01:31:31 UTC
Good idea.  Yes, it's worth a try.

JSON can probably escape any problematic characters.  There may be portability issues with some characters, but we should be able to choose implementations which are compatible:

http://stackoverflow.com/a/8676021/14379
https://en.wikipedia.org/wiki/JSON#Data_portability_issues
http://www.bennadel.com/blog/2576-testing-which-ascii-characters-break-json-javascript-object-notation-parsing.htm

Comment 5 Chester Cheng 2014-09-18 01:56:50 UTC
I got a similar error, because of a hidden character in the translation.

==========
$ mvn org.zanata:zanata-maven-plugin:pull -Dzanata.encodeTabs=false
(...)
[ERROR] Operation failed: javax.xml.bind.UnmarshalException
 - with linked exception:
[org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 35273; An invalid XML character (Unicode: 0x8) was found in the element content of the document.]

    To retry from the last document, please set the following option(s):

        -Dzanata.fromDoc="Memory"

.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 19.387 s
[INFO] Finished at: 2014-09-18T11:44:11+10:00
[INFO] Final Memory: 19M/170M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.zanata:zanata-maven-plugin:3.3.2:pull (default-cli) on project standalone-pom: Zanata mojo exception: javax.xml.bind.UnmarshalException
[ERROR] - with linked exception:
[ERROR] [org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 35273; An invalid XML character (Unicode: 0x8) was found in the element content of the document.]
[ERROR] -> [Help 1]
[ERROR]

Comment 6 Sean Flanigan 2014-09-18 02:00:06 UTC
Just for reference, the workaround was to download the affected document from the web interface (fortunately, it was a PO file, so it could be downloaded that way) and search for the offending character:

    grep --color='auto' -P -n '\x08' *.po

Comment 7 Zanata Migrator 2015-07-31 01:47:52 UTC
Migrated; check JIRA for bug status: http://zanata.atlassian.net/browse/ZNTA-543