Bug 1057420
| Summary: | REST APIs generate illegal XML when files contain invalid characters like 0x1b, 0x08 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Retired] Zanata | Reporter: | Patrick Huang <pahuang> | ||||
| Component: | Component-Maven, Component-Logic, Component-PythonClient, Component-zanata-client | Assignee: | Patrick Huang <pahuang> | ||||
| Status: | CLOSED UPSTREAM | QA Contact: | Zanata-QA Mailling List <zanata-qa> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 3.1 | CC: | ccheng, djansen, sflaniga, sshedmak, zanata-bugs | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2015-07-31 01:47:52 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Patrick Huang
2014-01-24 04:42:36 UTC
Created attachment 854745 [details]
Test file
This is a cut down version of production file.
Patrick, regarding the ESCAPE char handling, see https://java.net/jira/browse/JAXB-614 . Note that the workaround in the first comment is no good to us, because it irreversibly converts all illegal chars into the same char. It seems that there is no good way of representing control characters in plain XML, even with CDATA (apparently). We either need an alternative/extended XML schema for our REST service which performs (eg) base64 encoding/decoding for control chars, or we could just push POT/PO files directly to Zanata for processing on the server, thus bypassing the XML stage entirely. In the meantime, we should make sure we detect these control characters before JAXB goes and generates illegal XML. As a workaround, I would recommend separating the non-translatable text[1] from the translatable text (eg names of colours). Escape characters and ANSI sequences are very likely to be difficult for translators to deal with anyway, because the editor may not be able to show the escape character very well. [1] including ANSI codes, any other control codes and command-line keywords like "status-line" and "save-confirmation" if we use json instead will it help? Good idea. Yes, it's worth a try. JSON can probably escape any problematic characters. There may be portability issues with some characters, but we should be able to choose implementations which are compatible: http://stackoverflow.com/a/8676021/14379 https://en.wikipedia.org/wiki/JSON#Data_portability_issues http://www.bennadel.com/blog/2576-testing-which-ascii-characters-break-json-javascript-object-notation-parsing.htm I got a similar error, because of a hidden character in the translation.
==========
$ mvn org.zanata:zanata-maven-plugin:pull -Dzanata.encodeTabs=false
(...)
[ERROR] Operation failed: javax.xml.bind.UnmarshalException
- with linked exception:
[org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 35273; An invalid XML character (Unicode: 0x8) was found in the element content of the document.]
To retry from the last document, please set the following option(s):
-Dzanata.fromDoc="Memory"
.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 19.387 s
[INFO] Finished at: 2014-09-18T11:44:11+10:00
[INFO] Final Memory: 19M/170M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.zanata:zanata-maven-plugin:3.3.2:pull (default-cli) on project standalone-pom: Zanata mojo exception: javax.xml.bind.UnmarshalException
[ERROR] - with linked exception:
[ERROR] [org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 35273; An invalid XML character (Unicode: 0x8) was found in the element content of the document.]
[ERROR] -> [Help 1]
[ERROR]
Just for reference, the workaround was to download the affected document from the web interface (fortunately, it was a PO file, so it could be downloaded that way) and search for the offending character:
grep --color='auto' -P -n '\x08' *.po
Migrated; check JIRA for bug status: http://zanata.atlassian.net/browse/ZNTA-543 |