Bug 735904 - Stored topic XML should be UTF-8
Summary: Stored topic XML should be UTF-8
Alias: None
Product: PressGang CCMS
Classification: Community
Component: Web-UI
Version: 1.x
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Matthew Casperson
QA Contact: Dana Mison
: 741505 (view as bug list)
Depends On:
TreeView+ depends on / blocked
Reported: 2011-09-06 06:25 UTC by Dana Mison
Modified: 2014-08-04 22:26 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2011-10-05 05:35:14 UTC

Attachments (Terms of Use)

Description Dana Mison 2011-09-06 06:25:42 UTC
20110901-0957 /CustomSearchTopics.xhtml

Currently the generated XML header on topicXML is set to UTF-16 encoding.  This could be problematic as most tools will generate text files as UTF-8 which will then create a conflict when processing the files are processed as XML.

UTF-8 is more widely used and less likely to encounter problems in the future.

Comment 1 Joshua Wulf 2011-09-27 05:30:23 UTC
*** Bug 741505 has been marked as a duplicate of this bug. ***

Comment 2 Joshua Wulf 2011-09-27 05:31:44 UTC
In fact, Publican cannot build UTF-16 encoded topics:

encoding specified in XML declaration is incorrect at line 1, column 30, byte
<?xml version="1.0" encoding="UTF-16"?>

Comment 3 Stephen Gordon 2011-09-27 23:49:22 UTC
After investigating an issue Josh was having exporting topics from Skynet over HTTP using the topic tool it seems to me that the content itself is actually UTF-8 encoded anyway (I appear to get an error on the first character outside of the header because it's the UTF-8 encoding for a carriage return rather than UTF-16)? The header needs to be updated to match the content, so UTF-8.

Comment 4 Stephen Gordon 2011-09-28 01:05:15 UTC
Further investigation has shown I still get errors on these topics even after trying a few different common encodings in the header (utf-8, us-ascii). I'm not entirely sure why but Xerces chokes on them with this SAXParserException:

[Fatal Error] 32.xml:1:40: Content is not allowed in prolog.

Normally this would mean there are errant characters in front of the header but this doesn't appear to be the case. I have noticed that dos style newlines are also in use but again after running the file through dos2unix I still encounter the same issue.

Comment 5 Matthew Casperson 2011-10-03 22:54:58 UTC
The "Content is not allowed in prolog" error is because of a mismatch of encodings. A UTF-8 document that says it is encoded in UTF-16 will actually be interpreted as having a space before each character. So:

<?xml version="1.0"?>

will be read as:

 < ? x m l   v e r s i o n = " 1 . 0 " ? >

The leading space is then reported as an errant characters in front of the header.

Unfortunately the Java XML classes will always serialize an in-memory XML document to a string that says it is encoded with UTF-16 - http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSSerializer-writeToString.

The workaround is to do a simple find and replace to force the encoding to UTF-8 after the XML DOM object has been serialized to a string.

Comment 6 Matthew Casperson 2011-10-03 23:03:48 UTC
Fixed in 20111004-0900

Comment 7 Stephen Gordon 2011-10-04 01:15:02 UTC
I thought regardless of whether the content is UTF-8 or UTF-16 the header was supposed to be ASCII?

Comment 8 David Ryan 2011-10-04 06:01:26 UTC
Confirming UTF-8 on header output. Skynet internal tools DocBook export now building correctly with Publican.

Note You need to log in before you can comment on or make changes to this bug.