| Summary: | Stored topic XML should be UTF-8 | ||
|---|---|---|---|
| Product: | [Community] PressGang CCMS | Reporter: | Dana Mison <dmison> |
| Component: | Web-UI | Assignee: | Matthew Casperson <mcaspers> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Dana Mison <dmison> |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 1.x | CC: | cbredesen, dryan, jwulf, sgordon, topic-tool-list |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-10-05 05:35:14 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Dana Mison
2011-09-06 06:25:42 UTC
*** Bug 741505 has been marked as a duplicate of this bug. *** In fact, Publican cannot build UTF-16 encoded topics: encoding specified in XML declaration is incorrect at line 1, column 30, byte 30: <?xml version="1.0" encoding="UTF-16"?> =============================^ After investigating an issue Josh was having exporting topics from Skynet over HTTP using the topic tool it seems to me that the content itself is actually UTF-8 encoded anyway (I appear to get an error on the first character outside of the header because it's the UTF-8 encoding for a carriage return rather than UTF-16)? The header needs to be updated to match the content, so UTF-8. Further investigation has shown I still get errors on these topics even after trying a few different common encodings in the header (utf-8, us-ascii). I'm not entirely sure why but Xerces chokes on them with this SAXParserException: [Fatal Error] 32.xml:1:40: Content is not allowed in prolog. Normally this would mean there are errant characters in front of the header but this doesn't appear to be the case. I have noticed that dos style newlines are also in use but again after running the file through dos2unix I still encounter the same issue. The "Content is not allowed in prolog" error is because of a mismatch of encodings. A UTF-8 document that says it is encoded in UTF-16 will actually be interpreted as having a space before each character. So: <?xml version="1.0"?> will be read as: < ? x m l v e r s i o n = " 1 . 0 " ? > The leading space is then reported as an errant characters in front of the header. Unfortunately the Java XML classes will always serialize an in-memory XML document to a string that says it is encoded with UTF-16 - http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSSerializer-writeToString. The workaround is to do a simple find and replace to force the encoding to UTF-8 after the XML DOM object has been serialized to a string. Fixed in 20111004-0900 I thought regardless of whether the content is UTF-8 or UTF-16 the header was supposed to be ASCII? Confirming UTF-8 on header output. Skynet internal tools DocBook export now building correctly with Publican. |