(Description copied from US31.) This may not be a frequent problem any more. Earlier, a number of translators often lost their translations if the file headers or translation editors overwrote the character encoding for the file. This was a common case for the initscripts file in Fedora. The check can be a preventive one. Note: POT/PO/XML/Properties import should check for suitable encoding (probably UTF-8). Make sure browser renders Zanata and webtrans in Unicode. Clients should ensure that encoding is correct. If wrong encoding is used, an error should be shown and nothing should be pushed.
polib 0.7.0 have include a method called detect_encoding, in zanata-python-client, the pofile is loaded with the default settings, which means polib does auto-detection, so it should be fine on the reading side. For writing side, polib doesn't generating headers automatically, but polib should check that the encoding in the header matches the actual encoding (in most cases utf-8) used when saving a PO file. And at the same time, we should make sure input the correct encoding into the header. So i will try to add a extra check in polib 0.7.0 for match between header and actual encoding.
*** Bug 758630 has been marked as a duplicate of this bug. ***
A survey of Google Code Search finds real-world POT files with these charset headers: CHARSET utf-8 iso-8859-1 iso-8859-15 ascii latin1 gbk de_DE "" (empty string) utf8 iso-8859-2 CP1252 (de_DE and "" look like mistakes to me.) A survey of all the POT files in the src directory on my computer (a collection of random open source projects) only finds these: CHARSET UTF-8 utf-8 I think we could get away with accepting any POT file where charset has one of the above three values (plus UTF8, ASCII) without warning, and accepting anything else with a warning, but treating it as UTF-8 anyway. Even if the POT file declares a different charset, it probably only contains plain ASCII or UTF-8 characters anyway. (Maybe Latin-1 with some toolchains.) Or if we want to be safer, we could reject POT files which have other charset values.
Hi, For writing side, I have commit a pull request to polib and wait for the response from author of polib. For reading side, I add a check for charset in header entry of pot, and give a warning if charset values are not one of these: [CHARSET, UTF-8, utf-8, utf8, UTF8, ascii] for po files, if charset value equal 'CHARSET', i will also give a warning since this value is not acceptable for po files. One thing i am concerned is polib using codecs lib of python to detect encoding of pot file. If encoding is valid, polib will try to decode the file with that encoding. Please check the list of the encodings supported by python. http://docs.python.org/library/codecs.html#standard-encodings Base on the list, "iso-8859-1, iso-8859-15, ascii, latin1, gbk, iso-8859-2, CP1252" are valid encoding for python, so polib will decode the pot file with these encodings not utf-8. If we want to force decode with UTF-8, we probably could use polib.pofile('name', encoding="UTF-8")
Hi James, We should accept ASCII too, so I hope it's not case sensitive. Is polib trying to auto-detect the encoding, or is it using the charset encoding metadata from the Gettext header? The Python standard encodings do seem to include UTF-8, are you just saying that it will try the other encodings first?
James, I assume this is the pull request? https://bitbucket.org/izi/polib/pull-request/2/add-check-for-charset-in-content-type-of Assuming the pull request is accepted, how far apart are the polib releases? Perhaps we should patch the polib Fedora package in the meantime.
Hi Sean, > We should accept ASCII too, so I hope it's not case sensitive. OK, i will add ASCII too. > Is polib trying to auto-detect the encoding, or is it using the charset > encoding metadata from the Gettext header? polib will use the charset encoding metadata from the Gettext Header, then if it is not an supported encoding of python, then it will use UTF-8 instead. >The Python standard encodings do > seem to include UTF-8, are you just saying that it will try the other encodings > first? Yeah, i am saying that polib will try the other encodings first, if it is an supported encoding and set in charset encoding metadata from the Gettext Header, like iso-8859-1, iso-8859-15, ascii, latin1, gbk, etc. >>>I think we could get away with accepting any POT file where charset has one of >>>the above three values (plus UTF8, ASCII) without warning, and accepting >>>anything else with a warning, but treating it as UTF-8 anyway. For example, if the charset encoding metadata from the Gettext is set to gbk, polib will decode po file with gbk but not UTF-8, since gbk is supported by python. So i just want to make sure that we do want to force UTF-8 in this situation.
(In reply to comment #6) > James, I assume this is the pull request? > > https://bitbucket.org/izi/polib/pull-request/2/add-check-for-charset-in-content-type-of > > Assuming the pull request is accepted, how far apart are the polib releases? > Perhaps we should patch the polib Fedora package in the meantime. Ok, I will ask Ding's help for patch the polib Fedora package.
(In reply to comment #7) > Hi Sean, > > > We should accept ASCII too, so I hope it's not case sensitive. > > OK, i will add ASCII too. > > > Is polib trying to auto-detect the encoding, or is it using the charset > > encoding metadata from the Gettext header? > > polib will use the charset encoding metadata from the Gettext Header, then if > it is not an supported encoding of python, then it will use UTF-8 instead. > > >The Python standard encodings do > > seem to include UTF-8, are you just saying that it will try the other encodings > > first? > > Yeah, i am saying that polib will try the other encodings first, if it is an > supported encoding and set in charset encoding metadata from the Gettext > Header, like iso-8859-1, iso-8859-15, ascii, latin1, gbk, etc. > > >>>I think we could get away with accepting any POT file where charset has one of > >>>the above three values (plus UTF8, ASCII) without warning, and accepting > >>>anything else with a warning, but treating it as UTF-8 anyway. > > For example, if the charset encoding metadata from the Gettext is set to gbk, > polib will decode po file with gbk but not UTF-8, since gbk is supported by > python. So i just want to make sure that we do want to force UTF-8 in this > situation. No, using the encoding from the Gettext header, or failing that the auto-detected encoding, should be fine. We don't need to force UTF-8 if polib can do better than that; it's just that UTF-8 seemed like a reasonable compromise. Does polib tell you when it has to use auto-detection, rather than the Gettext header? If so, we could generate a warning in that case. Otherwise I don't think a warning is needed.
Added case-insensitive validation for PO Content-Type header. If the character set on this header does not have one of the following values (case-insensitive), the server will assume UTF-8 and return a warning message to the clients: UTF8, UTF-8, CHARSET, ASCII.
In About Fedora/f11/About_Fedora.pot, I have changed the line: Content-Type: application/x-xml2pot; charset=UTF-8\n to Content-Type: application/x-xml2pot; charset=BBB\n Both python client (1.3.3) and mvn client (API timestamp is 20120127-0946, but server API timestamp is 20120127-0944) let this slip through.
what version of the mvn client and server were you using? 1.5-SNAPSHOT?
Hi Ding, Yes, the python client only show a warning message about wrong value of charset. Do you think that python client should stop processing? Or we continue push content but modify the charset value to UTF-8? which procedure is better for you?
We really should be checking/fixing the charset at the point where it is used for reading or writing, so I've moved it to the PO classes which are used by client and server. The charset is now checked by the Maven client before uploading, and an unsupported charset will cause an error. Also, when writing gettext files, the used encoding (UTF-8) is written into the gettext header: https://github.com/zanata/zanata/commit/eaae69522de72b9fe9c140cb25c4cdb5c1d9dbe3 (master) James, the original report said unsupported encoding should cause an error, so Ding is right; we should abort the push, not just log a warning. Also when writing out PO/POT files (on pull) the Python client should modify the Content-Type charset to specify UTF-8, since that's what we are writing. See the above commit for what I did in the Java code.
Also removed server-side check: https://github.com/zanata/zanata/commit/c16ae7af44c5e10203a1fc7f2b9615cd784dcf6f
specify the Content-Type charset to UTF-8 when writing out PO files.: https://github.com/zanata/zanata-python-client/commit/a4b37148e44afb488d8a8db18ab32d179379d6a1
Please make sure you preserve the mime type. You should be able to use something virtually the same as I used in Java.
Thanks Sean, i have modified code of python client to preserve the mine type: https://github.com/zanata/zanata-python-client/commit/607ff4a2730a9ac1eb431e140958a25f90b24f11
James, the program works as expected, yet it does not give the reason if the .po file states the unsupported encoding.
maven client verified with 1.5-SNAPSHOT. Note that 1.4 Snapshot will not have this fix.
Hi, Ding, I think Sean had modify the log to give the reason if unsupported encoding detected: https://github.com/zanata/zanata-python-client/commit/b9eff21275201cfa05f8eb410dad3b8b561186e4 Could you help me to verify that?
VERIFIED with maven: 1.5-SNAPSHOT python: 1.3.3-11-g607f
*** Bug 694720 has been marked as a duplicate of this bug. ***
zanata-python-client-1.3.4-1.fc16.noarch While pushing nds.po i am getting error "error: Unsupported encoding; please change the Content-Type charset (UTF-8 recommended)" nds.po contains "Content-Type: text/plain; charset=ISO-8859-1\n"
Hi Pravin, Yes, the behaviour of zanata-python-client is correct. zanata server only support UTF-8 encoding, so you need to change the charset in header of po file to UTF-8
No props, i will update the encoding. I have not played much with ISO-8859-1 and UTF-8 compatibility but if both are 100% compatible, i think better to update encoding while pushing files to zanata. Instead of popping error msg, might be some msg like "Enhanced encoding of "xyz.po" from "X" to "UTF-8" for better compatibility etc." will be better.
Hi Pravin Thanks a lot for your suggestion. I just push the version 1.3.4 zanata-python-client to stable. I will keep current error msg and close it as next release. But i will modify the error message to your suggestions when i package 1.3.5
I just found an example for non UTF-8 po file: ja.po in GNU tar PO Header: "Project-Id-Version: GNU tar 1.25\n" "Report-Msgid-Bugs-To: bug-tar\n" "POT-Creation-Date: 2011-03-12 11:53+0200\n" "PO-Revision-Date: 2010-11-08 17:57+0900\n" "Last-Translator: Masahito Yamaga <ma>\n" "Language-Team: Japanese <translation-team-ja.net>\n" "Language: ja\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=EUC-JP\n" "Content-Transfer-Encoding: 8bit\n" "Plural-Forms: nplurals=2; plural=0;\n" ============================================ It is easy though, to covert it to a UTF-8 po file by following command: cp ja.po ja.po.orig\;iconv -f EUC-JP -t UTF-8 -o ja.po ja.po.orig\;sed -i -e 's/charset=EUC-JP/charset=UTF-8/' ja.po