735904 – Stored topic XML should be UTF-8

Bug 735904 - Stored topic XML should be UTF-8

Summary: Stored topic XML should be UTF-8

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	PressGang CCMS
Classification:	Community
Component:	Web-UI
Sub Component:
Version:	1.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Matthew Casperson
QA Contact:	Dana Mison
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	741505 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-09-06 06:25 UTC by Dana Mison
Modified:	2014-08-04 22:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-10-05 05:35:14 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dana Mison 2011-09-06 06:25:42 UTC

20110901-0957 /CustomSearchTopics.xhtml

Currently the generated XML header on topicXML is set to UTF-16 encoding.  This could be problematic as most tools will generate text files as UTF-8 which will then create a conflict when processing the files are processed as XML.

UTF-8 is more widely used and less likely to encounter problems in the future.

Comment 1 Joshua Wulf 2011-09-27 05:30:23 UTC

*** Bug 741505 has been marked as a duplicate of this bug. ***

Comment 2 Joshua Wulf 2011-09-27 05:31:44 UTC

In fact, Publican cannot build UTF-16 encoded topics:


encoding specified in XML declaration is incorrect at line 1, column 30, byte
30:
<?xml version="1.0" encoding="UTF-16"?>
=============================^

Comment 3 Stephen Gordon 2011-09-27 23:49:22 UTC

After investigating an issue Josh was having exporting topics from Skynet over HTTP using the topic tool it seems to me that the content itself is actually UTF-8 encoded anyway (I appear to get an error on the first character outside of the header because it's the UTF-8 encoding for a carriage return rather than UTF-16)? The header needs to be updated to match the content, so UTF-8.

Comment 4 Stephen Gordon 2011-09-28 01:05:15 UTC

Further investigation has shown I still get errors on these topics even after trying a few different common encodings in the header (utf-8, us-ascii). I'm not entirely sure why but Xerces chokes on them with this SAXParserException:

[Fatal Error] 32.xml:1:40: Content is not allowed in prolog.

Normally this would mean there are errant characters in front of the header but this doesn't appear to be the case. I have noticed that dos style newlines are also in use but again after running the file through dos2unix I still encounter the same issue.

Comment 5 Matthew Casperson 2011-10-03 22:54:58 UTC

The "Content is not allowed in prolog" error is because of a mismatch of encodings. A UTF-8 document that says it is encoded in UTF-16 will actually be interpreted as having a space before each character. So:

<?xml version="1.0"?>

will be read as:

 < ? x m l   v e r s i o n = " 1 . 0 " ? >

The leading space is then reported as an errant characters in front of the header.

Unfortunately the Java XML classes will always serialize an in-memory XML document to a string that says it is encoded with UTF-16 - http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSSerializer-writeToString.

The workaround is to do a simple find and replace to force the encoding to UTF-8 after the XML DOM object has been serialized to a string.

Comment 6 Matthew Casperson 2011-10-03 23:03:48 UTC

Fixed in 20111004-0900

Comment 7 Stephen Gordon 2011-10-04 01:15:02 UTC

I thought regardless of whether the content is UTF-8 or UTF-16 the header was supposed to be ASCII?

Comment 8 David Ryan 2011-10-04 06:01:26 UTC

Confirming UTF-8 on header output. Skynet internal tools DocBook export now building correctly with Publican.

Note You need to log in before you can comment on or make changes to this bug.