Description of problem: The httpd.conf as distributed with FC2 contains a number of errors in character sets. 1. It sets the default character set to UTF-8: AddDefaultCharset UTF-8 This violates the W3C standard (as indicated in the comment above it). 2. It defines aliases for ISO-8859-1, -2, -3, and so on, as Latin1, 2, 3, and so on. Although this is valid for some, this does not hold for all. In particular, Latin9 is equivalent to ISO-8859-15, not -9.
1. Using a default charset of UTF-8 does not violate any standard. The comment just states that if you do *not* specify a charset, browsers are required to presume the charset is ISO-8859-1. 2. Yes, thanks for the report, this was fixed upstream recently. There was discussion about removing these lines completely, too: do you actually use (or want to use) files named .latinX, etc?
1. Right, although the comment "merely stating the obvious" gives me the impression that the intention was to specify iso-8859-1 instead. 2. No, I wouldn't. I'd prefer to use a suitable <meta http-equiv="content-type" content="text/html; charset=iso8859-15"> (and have it transferred a as a real 'Content-Type' header) instead. Thanks for the good work!
OK, thanks. We're past the point of making changes to the default config for FC3 now, and this is not really critical, so this change will be deferred to FC4. So for FC4 we will 1. update the comment above AddDefaultCharset to be less confusing w.r.t the UTF-8 default 2. remove or comment-out all the AddCharset lines which nobody really needs anyway.
I'm afraid there's more to it. As W3C states, a HTML document must be assumed to be ISO-8859-1 unless specified otherwise. With the "AddDefaultCharset UTF8" setting, such a document will be served by Apache with a "Content-Type: text/html; charset=utf8" heading. In other words, Apache enforces it to be interpreted as a UTF8 document, which I think is not in accordance with the W3C guidelines. Moreover, browsers seems to give priority to the server Content-Type header over a <meta http-equiv="content-type" ...>. This makes it impossibe for a document to specify its own character set. In the current situation, all plain .html documents treated as UTF8, and I really think this is not correct. Therefore I think that _IF_ a default charset is used, it can only be ISO-8859-1, _AND_ if a document specifies a charset in a <meta> tag, this tag must override the default charset wich implies it must be 'promoted' to a real Content-Type header (otherwise browsers will ignore the <meta> tag due to the explicit Content-Type header). Another option is to remove the AddDefaultCharset setting completely, in which case the Content-Type header will be "text/html" (without a charset) and everything works as it should.
You seem to be arguing that there is a W3C spec which states that it is incorrect to send "Content-Type: text/xml; charset=utf-8" by default. This is not the case; please find a specific spec reference to back this up. Not sending a default charset can allow some cross-site-scripting attacks; see http://www.cert.org/tech_tips/malicious_code_mitigation.html, that is why we must present a default charset. Given that: 1) we must present a default charset, and 2) applications by default run in a UTF-8 locale in Fedora Core, and hence newly created content is UTF-8 the only sane AddDefaultCharset setting for httpd.conf in Fedora Core is UTF-8. Yes, browsers are required to honour the charset in the Content-Type header over a META tag, that's in the HTML spec. This means you need to change the charset content-type header to match your content, either globally in httpd.conf or locally in an .htaccess file. The current default charset will of course be wrong if your content is all encoded as ISO-8859-1. And a default charset of ISO-8859-1 will be wrong if your content is all UTF-8. That's why it's configurable.
I've been just bitten by this. And I must say I don't like it (CERT tip or not). I have bunch of ISO-8859-2 pages, all were using the meta tag to specify the charset. When I moved them to my FC2 machine, all of the sudden things were broken. By shear luck I found that Apache was configured to force charset of all pages to UTF-8. Argh!!!!! Needless to say, the line was commented out immidiately. IMHO, it is up to the developer of program that delivers dynamic content to specify charset. Not for the server to enforce it on *everything*.
$SUBJECT is done now for FC4.