Currently the generated XML header on topicXML is set to UTF-16 encoding. This could be problematic as most tools will generate text files as UTF-8 which will then create a conflict when processing the files are processed as XML.
UTF-8 is more widely used and less likely to encounter problems in the future.
*** Bug 741505 has been marked as a duplicate of this bug. ***
In fact, Publican cannot build UTF-16 encoded topics:
encoding specified in XML declaration is incorrect at line 1, column 30, byte
<?xml version="1.0" encoding="UTF-16"?>
After investigating an issue Josh was having exporting topics from Skynet over HTTP using the topic tool it seems to me that the content itself is actually UTF-8 encoded anyway (I appear to get an error on the first character outside of the header because it's the UTF-8 encoding for a carriage return rather than UTF-16)? The header needs to be updated to match the content, so UTF-8.
Further investigation has shown I still get errors on these topics even after trying a few different common encodings in the header (utf-8, us-ascii). I'm not entirely sure why but Xerces chokes on them with this SAXParserException:
[Fatal Error] 32.xml:1:40: Content is not allowed in prolog.
Normally this would mean there are errant characters in front of the header but this doesn't appear to be the case. I have noticed that dos style newlines are also in use but again after running the file through dos2unix I still encounter the same issue.
The "Content is not allowed in prolog" error is because of a mismatch of encodings. A UTF-8 document that says it is encoded in UTF-16 will actually be interpreted as having a space before each character. So:
will be read as:
< ? x m l v e r s i o n = " 1 . 0 " ? >
The leading space is then reported as an errant characters in front of the header.
Unfortunately the Java XML classes will always serialize an in-memory XML document to a string that says it is encoded with UTF-16 - http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSSerializer-writeToString.
The workaround is to do a simple find and replace to force the encoding to UTF-8 after the XML DOM object has been serialized to a string.
Fixed in 20111004-0900
I thought regardless of whether the content is UTF-8 or UTF-16 the header was supposed to be ASCII?
Confirming UTF-8 on header output. Skynet internal tools DocBook export now building correctly with Publican.