Bug 748727 - US31 As a translator I want the appropriate character encoding for my language to be used so that the content is saved in the correct encoding format
Summary: US31 As a translator I want the appropriate character encoding for my languag...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Zanata
Classification: Retired
Component: Component-Maven
Version: 1.4-SNAPSHOT
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: James Ni
QA Contact: Ding-Yi Chen
URL: https://community.rallydev.com/slm/ra...
Whiteboard:
: 694720 758630 (view as bug list)
Depends On:
Blocks: 748696 756230 zanata-1.5.0
TreeView+ depends on / blocked
 
Reported: 2011-10-25 07:51 UTC by Sean Flanigan
Modified: 2013-07-10 07:15 UTC (History)
5 users (show)

Fixed In Version: 1.3.4
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-04-05 06:35:05 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 758630 1 None None None 2021-01-20 06:05:38 UTC
Red Hat Bugzilla 982891 0 unspecified CLOSED Default encoding of PO files should be UTF-8 in Windows 2021-02-22 00:41:40 UTC

Internal Links: 758630 982891

Description Sean Flanigan 2011-10-25 07:51:51 UTC
(Description copied from US31.)

This may not be a frequent problem any more. Earlier, a number of translators often lost their translations if the file headers or translation editors overwrote the character encoding for the file. This was a common case for the initscripts file in Fedora. The check can be a preventive one.

Note: POT/PO/XML/Properties import should check for suitable encoding (probably UTF-8).

Make sure browser renders Zanata and webtrans in Unicode.


Clients should ensure that encoding is correct. If wrong encoding is used, an error should be shown and nothing should be pushed.

Comment 1 James Ni 2011-10-27 03:13:43 UTC
polib 0.7.0 have include a method called detect_encoding, in zanata-python-client, the pofile is loaded with the default settings, which means polib does auto-detection, so it should be fine on the reading side.

For writing side, polib doesn't generating headers automatically, but polib should check that the encoding in the header matches the actual encoding (in most cases utf-8) used when saving a PO file. And at the same time, we should make sure input the correct encoding into the header. 

So i will try to add a extra check in polib 0.7.0 for match between header and actual encoding.

Comment 2 Runa Bhattacharjee 2011-11-30 10:16:29 UTC
*** Bug 758630 has been marked as a duplicate of this bug. ***

Comment 3 Sean Flanigan 2011-12-07 08:14:27 UTC
A survey of Google Code Search finds real-world POT files with these charset headers:

CHARSET
utf-8
iso-8859-1
iso-8859-15
ascii
latin1
gbk
de_DE
"" (empty string)
utf8
iso-8859-2
CP1252

(de_DE and "" look like mistakes to me.)

A survey of all the POT files in the src directory on my computer (a collection of random open source projects) only finds these:

CHARSET 
UTF-8
utf-8


I think we could get away with accepting any POT file where charset has one of the above three values (plus UTF8, ASCII) without warning, and accepting anything else with a warning, but treating it as UTF-8 anyway.  Even if the POT file declares a different charset, it probably only contains plain ASCII or UTF-8 characters anyway.  (Maybe Latin-1 with some toolchains.)

Or if we want to be safer, we could reject POT files which have other charset values.

Comment 4 James Ni 2011-12-13 03:39:33 UTC
Hi, 

For writing side, I have commit a pull request to polib and wait for the response from author of polib. 

For reading side, I add a check for charset in header entry of pot, and give a warning if charset values are not one of these:

[CHARSET, UTF-8, utf-8, utf8, UTF8, ascii]

for po files, if charset value equal 'CHARSET', i will also give a warning since this value is not acceptable for po files. 

One thing i am concerned is polib using codecs lib of python to detect encoding of pot file. If encoding is valid, polib will try to decode the file with that encoding. Please check the list of the encodings supported by python. 

http://docs.python.org/library/codecs.html#standard-encodings

Base on the list, "iso-8859-1, iso-8859-15, ascii, latin1, gbk, iso-8859-2, CP1252" are valid encoding for python, so polib will decode the pot file with these encodings not utf-8. If we want to force decode with UTF-8, we probably could use polib.pofile('name', encoding="UTF-8")

Comment 5 Sean Flanigan 2011-12-13 07:19:26 UTC
Hi James,

We should accept ASCII too, so I hope it's not case sensitive.

Is polib trying to auto-detect the encoding, or is it using the charset encoding metadata from the Gettext header?  The Python standard encodings do seem to include UTF-8, are you just saying that it will try the other encodings first?

Comment 6 Sean Flanigan 2011-12-13 07:27:41 UTC
James, I assume this is the pull request? 

https://bitbucket.org/izi/polib/pull-request/2/add-check-for-charset-in-content-type-of

Assuming the pull request is accepted, how far apart are the polib releases? Perhaps we should patch the polib Fedora package in the meantime.

Comment 7 James Ni 2011-12-13 08:46:26 UTC
Hi Sean,

> We should accept ASCII too, so I hope it's not case sensitive.

OK, i will add ASCII too.

> Is polib trying to auto-detect the encoding, or is it using the charset
> encoding metadata from the Gettext header?  

polib will use the charset encoding metadata from the Gettext Header, then if it is not an supported encoding of python, then it will use UTF-8 instead. 

>The Python standard encodings do
> seem to include UTF-8, are you just saying that it will try the other encodings
> first?

Yeah, i am saying that polib will try the other encodings first, if it is an supported encoding and set in charset encoding metadata from the Gettext Header, like iso-8859-1, iso-8859-15, ascii, latin1, gbk, etc. 

>>>I think we could get away with accepting any POT file where charset has one of
>>>the above three values (plus UTF8, ASCII) without warning, and accepting
>>>anything else with a warning, but treating it as UTF-8 anyway.

For example, if the charset encoding metadata from the Gettext is set to gbk, polib will decode po file with gbk but not UTF-8, since gbk is supported by python. So i just want to make sure that we do want to force UTF-8 in this situation.

Comment 8 James Ni 2011-12-13 08:48:23 UTC
(In reply to comment #6)
> James, I assume this is the pull request? 
> 
> https://bitbucket.org/izi/polib/pull-request/2/add-check-for-charset-in-content-type-of
> 
> Assuming the pull request is accepted, how far apart are the polib releases?
> Perhaps we should patch the polib Fedora package in the meantime.

Ok, I will ask Ding's help for patch the polib Fedora package.

Comment 9 Sean Flanigan 2011-12-13 23:57:17 UTC
(In reply to comment #7)
> Hi Sean,
> 
> > We should accept ASCII too, so I hope it's not case sensitive.
> 
> OK, i will add ASCII too.
> 
> > Is polib trying to auto-detect the encoding, or is it using the charset
> > encoding metadata from the Gettext header?  
> 
> polib will use the charset encoding metadata from the Gettext Header, then if
> it is not an supported encoding of python, then it will use UTF-8 instead. 
> 
> >The Python standard encodings do
> > seem to include UTF-8, are you just saying that it will try the other encodings
> > first?
> 
> Yeah, i am saying that polib will try the other encodings first, if it is an
> supported encoding and set in charset encoding metadata from the Gettext
> Header, like iso-8859-1, iso-8859-15, ascii, latin1, gbk, etc. 
> 
> >>>I think we could get away with accepting any POT file where charset has one of
> >>>the above three values (plus UTF8, ASCII) without warning, and accepting
> >>>anything else with a warning, but treating it as UTF-8 anyway.
> 
> For example, if the charset encoding metadata from the Gettext is set to gbk,
> polib will decode po file with gbk but not UTF-8, since gbk is supported by
> python. So i just want to make sure that we do want to force UTF-8 in this
> situation.

No, using the encoding from the Gettext header, or failing that the auto-detected encoding, should be fine.  

We don't need to force UTF-8 if polib can do better than that; it's just that UTF-8 seemed like a reasonable compromise.

Does polib tell you when it has to use auto-detection, rather than the Gettext header?  If so, we could generate a warning in that case.  Otherwise I don't think a warning is needed.

Comment 10 Carlos Munoz 2011-12-14 01:35:32 UTC
Added case-insensitive validation for PO Content-Type header. If the character set on this header does not have one of the following values (case-insensitive), the server will assume UTF-8 and return a warning message to the clients:
UTF8, UTF-8, CHARSET, ASCII.

Comment 11 Ding-Yi Chen 2012-01-31 07:10:00 UTC
In About Fedora/f11/About_Fedora.pot, I have changed the line:
Content-Type: application/x-xml2pot; charset=UTF-8\n
to 
Content-Type: application/x-xml2pot; charset=BBB\n

Both python client (1.3.3) and 
mvn client (API timestamp is 20120127-0946, but server API timestamp is 20120127-0944)
let this slip through.

Comment 12 Sean Flanigan 2012-02-01 00:01:53 UTC
what version of the mvn client and server were you using?  1.5-SNAPSHOT?

Comment 13 James Ni 2012-02-01 02:11:56 UTC
Hi Ding,

Yes, the python client only show a warning message about wrong value of charset. Do you think that python client should stop processing? Or we continue push content but modify the charset value to UTF-8? which procedure is better for you?

Comment 14 Sean Flanigan 2012-02-01 02:45:08 UTC
We really should be checking/fixing the charset at the point where it is used for
reading or writing, so I've moved it to the PO classes which are used by client
and server.

The charset is now checked by the Maven client before uploading, and an
unsupported charset will cause an error.  Also, when writing gettext files, the
used encoding (UTF-8) is written into the gettext header:
https://github.com/zanata/zanata/commit/eaae69522de72b9fe9c140cb25c4cdb5c1d9dbe3
(master)

James, the original report said unsupported encoding should cause an error, so Ding is right; we should abort the push, not just log a warning.

Also when writing out PO/POT files (on pull) the Python client should modify the Content-Type charset to specify UTF-8, since that's what we are writing.  See the above commit for what I did in the Java code.

Comment 15 Sean Flanigan 2012-02-01 02:56:57 UTC
Also removed server-side check: https://github.com/zanata/zanata/commit/c16ae7af44c5e10203a1fc7f2b9615cd784dcf6f

Comment 16 James Ni 2012-02-01 05:41:30 UTC
specify the Content-Type charset to UTF-8 when writing out PO files.:
https://github.com/zanata/zanata-python-client/commit/a4b37148e44afb488d8a8db18ab32d179379d6a1

Comment 17 Sean Flanigan 2012-02-02 01:10:20 UTC
Please make sure you preserve the mime type.  You should be able to use something virtually the same as I used in Java.

Comment 18 James Ni 2012-02-02 05:49:02 UTC
Thanks Sean, i have modified code of python client to preserve the mine type:
https://github.com/zanata/zanata-python-client/commit/607ff4a2730a9ac1eb431e140958a25f90b24f11

Comment 19 Ding-Yi Chen 2012-02-03 01:07:11 UTC
James, the program works as expected, yet it does not give the reason if the .po file states the unsupported encoding.

Comment 20 Ding-Yi Chen 2012-02-03 01:19:55 UTC
maven client verified with 1.5-SNAPSHOT.
Note that 1.4 Snapshot will not have this fix.

Comment 21 James Ni 2012-02-03 03:04:49 UTC
Hi, Ding, 

I think Sean had modify the log to give the reason if unsupported encoding detected: 
https://github.com/zanata/zanata-python-client/commit/b9eff21275201cfa05f8eb410dad3b8b561186e4

Could you help me to verify that?

Comment 22 Ding-Yi Chen 2012-02-03 05:39:32 UTC
VERIFIED with maven: 1.5-SNAPSHOT
python:  1.3.3-11-g607f

Comment 23 Sean Flanigan 2012-02-13 01:28:50 UTC
*** Bug 694720 has been marked as a duplicate of this bug. ***

Comment 24 Pravin Satpute 2012-03-30 09:24:53 UTC
zanata-python-client-1.3.4-1.fc16.noarch

While pushing nds.po i am getting error "error: Unsupported encoding; please change the Content-Type charset (UTF-8 recommended)"

nds.po contains "Content-Type: text/plain; charset=ISO-8859-1\n"

Comment 25 James Ni 2012-03-30 10:04:25 UTC
Hi Pravin,

Yes, the behaviour of zanata-python-client is correct. zanata server only support UTF-8 encoding, so you need to change the charset in header of po file to UTF-8

Comment 26 Pravin Satpute 2012-03-30 10:42:00 UTC
No props, i will update the encoding.

I have not played much with ISO-8859-1 and UTF-8 compatibility but if both are 100% compatible, i think better to update encoding while pushing files to zanata.

Instead of popping error msg, might be some msg like "Enhanced encoding of "xyz.po" from "X" to "UTF-8" for better compatibility etc."  will be better.

Comment 27 James Ni 2012-04-05 06:38:50 UTC
Hi Pravin

Thanks a lot for your suggestion. I just push the version 1.3.4 zanata-python-client to stable. I will keep current error msg and close it as next release. But i will modify the error message to your suggestions when i package 1.3.5

Comment 28 Ding-Yi Chen 2012-04-20 05:20:29 UTC
I just found an example for non UTF-8 po file:

ja.po in GNU tar

PO Header:
"Project-Id-Version: GNU tar 1.25\n"
"Report-Msgid-Bugs-To: bug-tar\n"
"POT-Creation-Date: 2011-03-12 11:53+0200\n"
"PO-Revision-Date: 2010-11-08 17:57+0900\n"                         
"Last-Translator: Masahito Yamaga <ma>\n"             
"Language-Team: Japanese <translation-team-ja.net>\n"        
"Language: ja\n"                 
"MIME-Version: 1.0\n"  
"Content-Type: text/plain; charset=EUC-JP\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=0;\n"                           
============================================

It is easy though, to covert it to a UTF-8 po file by following command:

cp ja.po ja.po.orig\;iconv -f EUC-JP -t UTF-8 -o ja.po ja.po.orig\;sed -i -e 's/charset=EUC-JP/charset=UTF-8/' ja.po


Note You need to log in before you can comment on or make changes to this bug.