Hide Forgot
Description of problem: An em dash in the middle of a <para> causes this to appear: topic.xml:15: parser error : PCDATA invalid Char value 20 Version-Release number of selected component (if applicable): How reproducible: Add an em dash to the middle of a <para>. To do this, set the compose key and holding down the compose key while pressing the hyphen key three times. Steps to Reproduce: 1. 2. 3. Actual results: topic.xml:15: parser error : PCDATA invalid Char value 20 Expected results: Instead of "topic.xml:15: parser error : PCDATA invalid Char value 20", the message should be "The XML is well-formed." Additional info:
This works fine if you use the — entity, I believe this is the preferred approach and as such em dashes shouldn't be used directly. However I'd have to check with Matt on that one.
This is probably something to do with the way the emscripten virtual file system deals with UTF8 files. It looks like any non-ascii character will cause validation issues.
This bug describes an issue with UTF8 files and the Emscripten virtual file system - https://github.com/kripken/emscripten/pull/402. It appears to be fixed a year ago, but the xml.js library we are using (https://github.com/kripken/xml.js) is two years old. I'll have to recompile xml.js with the latest version of emscripten.
Fixed in 201401151143 and deployed to the dev server. xmllint has been recompiled from libxml2 2.9.1 with the latest version of Emscripten. The library and instructions on the compilation process can be found at https://github.com/pressgang-ccms/xsltproc.js. Now all UTF8 characters, like the mdash, will validate properly.
Verified that mdash as well as other UTF-8 characters validate correctly. Note: This also fixed an issue with characters from other languages being marked as invalid.