Bug 1043350

Summary: Em dash causes parser error
Product: [Community] PressGang CCMS Reporter: Zac Dover <zdover>
Component: Web-UIAssignee: Matthew Casperson <mcaspers>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 1.3CC: cbredesen, lnewson, mcaspers
Target Milestone: ---   
Target Release: 1.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-23 23:44:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Zac Dover 2013-12-16 04:32:41 UTC
Description of problem:
An em dash in the middle of a <para> causes this to appear:

  topic.xml:15: parser error : PCDATA invalid Char value 20

Version-Release number of selected component (if applicable):


How reproducible:
Add an em dash to the middle of a <para>. To do this, set the compose key and holding down the compose key while pressing the hyphen key three times.

Steps to Reproduce:
1.
2.
3.

Actual results:

 topic.xml:15: parser error : PCDATA invalid Char value 20

Expected results:

 Instead of "topic.xml:15: parser error : PCDATA invalid Char value 20", the message should be "The XML is well-formed."

Additional info:

Comment 1 Lee Newson 2013-12-16 04:42:44 UTC
This works fine if you use the &mdash; entity, I believe this is the preferred approach and as such em dashes shouldn't be used directly. However I'd have to check with Matt on that one.

Comment 2 Matthew Casperson 2014-01-12 20:48:38 UTC
This is probably something to do with the way the emscripten virtual file system deals with UTF8 files. It looks like any non-ascii character will cause validation issues.

Comment 3 Matthew Casperson 2014-01-12 20:53:06 UTC
This bug describes an issue with UTF8 files and the Emscripten virtual file system - https://github.com/kripken/emscripten/pull/402. It appears to be fixed a year ago, but the xml.js library we are using (https://github.com/kripken/xml.js) is two years old.

I'll have to recompile xml.js with the latest version of emscripten.

Comment 4 Matthew Casperson 2014-01-15 01:53:16 UTC
Fixed in 201401151143 and deployed to the dev server.

xmllint has been recompiled from libxml2 2.9.1 with the latest version of Emscripten. The library and instructions on the compilation process can be found at https://github.com/pressgang-ccms/xsltproc.js.

Now all UTF8 characters, like the mdash, will validate properly.

Comment 5 Lee Newson 2014-01-28 00:32:37 UTC
Verified that mdash as well as other UTF-8 characters validate correctly.

Note: This also fixed an issue with characters from other languages being marked as invalid.