Bug 1032340 (Client_pull_non_utf8)

Summary: RFE: Client should be able to pull non utf8 encoding .po files
Product: [Retired] Zanata Reporter: Ding-Yi Chen <dchen>
Component: Component-MavenAssignee: Michelle Kim <mkim>
Status: CLOSED UPSTREAM QA Contact: Zanata-QA Mailling List <zanata-qa>
Severity: low Docs Contact:
Priority: unspecified    
Version: 3.1CC: mfabian, sflaniga, zanata-bugs
Target Milestone: ---Keywords: screened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-29 03:28:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1032333    
Bug Blocks:    

Description Ding-Yi Chen 2013-11-20 02:12:52 UTC
Description of problem:
Some existing projects have po files that are not in utf8 encoding.
e.g. ja.po in GNU Tar 1.26 is in EUC-JP.

Some system is not in utf8 locale, thus, zanata should be able to output them in original format.


Version-Release number of selected component (if applicable):
org.zanata:zanata-maven-plugin:3.1.2

How reproducible:
Always

Steps to Reproduce:
1. Create project and version for gnu tar
2. mvn -e zanata:push
3. mvn -e zanata:push -Dzanata.pushType=trans -Dzanata.locales=ja
4. mvn -e zanata:pull -Dzanata.pullType=trans -Dzanata.locales=ja


Expected results:
ja.po should be in charset=EUC-JP


Additional info:

Comment 1 Sean Flanigan 2013-11-25 00:52:12 UTC
Why is it so important to output with the original encoding?  The Gettext tools (eg msgfmt, which converts .po to .mo) can handle UTF-8 PO files, as long as they are encoded correctly, with the correct header.

Comment 2 Ding-Yi Chen 2013-11-25 02:05:29 UTC
But not for the translators that works offline with non-utf8 encoding.

That said, glibc should be able to properly display utf8 mo files in, say 
ja_JP.eucjp systems.

Thus I downgrade the severity to "low".

Comment 3 Jens Petersen 2013-11-25 02:21:32 UTC
I guess no current projects in zanata use non-utf8 encodings.

But at least in the open source world there are still projects
using non-utf8 po files so it would be nice to support it.
Forcing upstream projects to change their encoding unilaterally
seems a bit heavy handed, even if it brings them into the 21st Century. :)

How it should be best handled I don't know... if not in the server
then perhaps the client could convert the encoding if it could refer
to the original files locally?  Tricky perhaps?

Comment 4 Sean Flanigan 2013-11-25 07:07:38 UTC
(In reply to Jens Petersen from comment #3)
> I guess no current projects in zanata use non-utf8 encodings.
> 
> But at least in the open source world there are still projects
> using non-utf8 po files so it would be nice to support it.
> Forcing upstream projects to change their encoding unilaterally
> seems a bit heavy handed, even if it brings them into the 21st Century. :)

Note that this bug is only about what encoding Zanata generates on output.  The question of whether Zanata can parse a Shift JIS file (for instance) is bug 1032333.

So, just to clarify, would this affect anyone other than a translator who:

1. Uses a non-Unicode locale for the operating system and
2. Edits .po files using a plain text editor (or other plain text tools) which uses the default, non-Unicode locale?

Any gettext-aware editor (like Lokalize) or gettext-aware tool (like gettext-tools or translate toolkit) would use the encoding declared in the header.  (Unless they have a shortcoming like our bug 1032333, I suppose.)

However, ordinary text processing tools would probably use the platform default encoding, in which case the difference between UTF-8 and Shift JIS would matter.

Comment 5 Sean Flanigan 2013-11-25 07:14:20 UTC
(In reply to Jens Petersen from comment #3)
> How it should be best handled I don't know... if not in the server
> then perhaps the client could convert the encoding if it could refer
> to the original files locally?  Tricky perhaps?

One problem would be how to decide whether to output Unicode or something else.  If the file was originally imported to Zanata from a non-Unicode file, we could record that fact and then output using the same encoding.  

But perhaps we should only use the original encoding if the user requests it, in which case we need another generation option for the pull command (or the PO download page).  And more options need more testing...

After all, letting Zanata normalise from random prehistoric encodings into Unicode could be a good thing in many cases.

Comment 7 Damian Jansen 2015-07-14 00:20:19 UTC
Reassigned to PM

Comment 8 Zanata Migrator 2015-07-29 03:28:17 UTC
Migrated; check JIRA for bug status: http://zanata.atlassian.net/browse/ZNTA-281