Bug 225576 - RFE: handle non-ascii filenames in archive properly
Summary: RFE: handle non-ascii filenames in archive properly
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: unzip
Version: rawhide
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Petr Stodulka
QA Contact: Ben Levenson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-01-31 13:47 UTC by Dmitry Butskoy
Modified: 2015-11-27 09:33 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-11-25 11:17:39 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
support for non-latin1 filenames in archive (5.92 KB, patch)
2007-01-31 13:47 UTC, Dmitry Butskoy
no flags Details | Diff
sample output with unzip on archive with CP737-encoded filenames (9.31 KB, text/plain)
2007-06-19 08:03 UTC, Ariel T. Glenn
no flags Details
support for not-latin1 filenames for version 6.0 (5.87 KB, patch)
2011-07-19 16:33 UTC, Dmitry Butskoy
no flags Details | Diff
The patch (hypothetical) for 6.10b (2.79 KB, patch)
2011-07-25 16:06 UTC, Dmitry Butskoy
no flags Details | Diff
screenshot of File Roller (30.03 KB, image/png)
2015-06-01 21:09 UTC, Mikhail
no flags Details
screenshot 1 (161.19 KB, image/png)
2015-11-23 15:17 UTC, Mikhail
no flags Details
screenshot 2 (114.33 KB, image/png)
2015-11-23 15:17 UTC, Mikhail
no flags Details
screenshot 3 (158.57 KB, image/png)
2015-11-23 16:35 UTC, Mikhail
no flags Details
archive for testing (292.00 KB, application/zip)
2015-11-23 19:54 UTC, Mikhail
no flags Details
fix print issue (12.61 KB, patch)
2015-11-25 11:14 UTC, Petr Stodulka
no flags Details | Diff
screencast (4.17 MB, application/ogg)
2015-11-25 14:25 UTC, Mikhail
no flags Details

Description Dmitry Butskoy 2007-01-31 13:47:01 UTC
Very often users are putting in archive files with names which contain non-ascii
symbols. I.e., just name the file in the local language, then zip it.

Currently, "unzip" assumes that so called "OEM encoding" is CP850 only, and that
locale's encoding is CP1252. (If certainly zip archive was not created under
Linux, in the such case the names seem to be stored transparently).

Because of this, to obtain the correct filename under, say russian, locale, we
need to do:  "unzip -l file.zip | iconv -f CP1252 -t CP850 | iconv -f CP866",
as actually CP866 are used under "russian" win32 locales to write into zip files.

Therefore, instead of "CP850-->CP1252" conversion, suitable for latin1 users
only, we need some more intelligent way.

I've created a patch which tries to solve this issue. The idea is to inspect the
current locale (under which "unzip" was invoked) and to determine the actual
conversion needed.

The obtaining of the result encoding is trivial (using nl_langinfo(3) and friends).
To guess what "OEM" encoding was actually used by win32 systems, we inspect the
language/country part of LC_ALL/LANG environment, and determine the needed CPxxx
by it. There are two way to do this: either just use a table (i.e.
"en"-->"CP850", "ru"-->"CP866", "jp"-->"CP932" etc.), or first get the "not-utf"
encoding for this locale and use table with "ISO-8859-1"-->"CP850",
"ISO-8859-5"-->"CP866", "EUC-JP"-->"CP932" ... Currently I prefer the second way.

The patch attached was successfully tested under ru_RU locales.

Please note, that I'm ready to do any further work with this if needed.

Comment 1 Dmitry Butskoy 2007-01-31 13:47:01 UTC
Created attachment 147018 [details]
support for non-latin1 filenames in archive

Comment 2 Dmitry Butskoy 2007-01-31 13:51:13 UTC
Surely the basic idea is an assumption that Fedora host and zip-creator host
both live in the same country... :) 

Comment 3 Ariel T. Glenn 2007-06-19 07:42:31 UTC
I have a related encoding issue; filenames are encoded in CP737; the function
Ext_ASCII_TO_Native() mangles the name so that when it's printed out (unzip -l)
or used to create the file, the name is not in any valid encoding anymore, and
so convmv can't be used on it, etc.

If I comment out the function call then I can run conmv -f CP737 -t utf8 (my
locale: el_GR.UTF8) and it works fine.  This patch unfortunately does not work
for me (applied to unzip-5.52-4.fc7.src.rpm).  It produces unrecognizable output.

Attachment follows with sample output...

Comment 4 Ariel T. Glenn 2007-06-19 08:03:33 UTC
Created attachment 157351 [details]
sample output with unzip on archive with CP737-encoded filenames

four runs.  in order: 

output from /usr/bin/unzip as shipped;
output from unzip with Dimitry's patch; 
output from unzip with Ext_ASCII_TO_Native() call commented out;
output from unzip with the call commented out, piped to iconv, successful.

maybe we could have a switch to allow the user to turn that call off; that's
better than the present state of affairs (having to run WinZip under wine, or
capture the list of names with zipnote and rename them manually).

Comment 5 Dmitry Butskoy 2007-06-20 14:28:36 UTC
It seems that my patch is just a bit incomplete.

Currently for your locale ("el_GR") the patch determines non-utf charmap as
"ISO-8859-7" and chooses correspond "CP869", whereas you need "CP737".

It could be nice if you write some more words about ISO-8859-7 and el_GR, and
I'll change my patch according to your information.

Anyway, it is impossible to handle all locales at one moment -- I just have no
all needed info about them. I've tried to handle what I know (assume most
wide-used), then people will report bugs (as this one) and I or "unzip"
maintainer will add support for new locales too.

BTW, may be you know some resource where all the "xx_XX --> CPxxx" maps can be
found?


Comment 6 Ariel T. Glenn 2007-06-20 21:13:03 UTC
The problem is that sometimes I will want cp737 -> utf8, and sometimes iso8859-7
-> utf-8, and sometimes something else, depending on the source of the files. 
For that matter, zip should not assume (without an option to override) that the
files to be unzipped necessarily map to the user's locale.  If you really want
to encode conversion, let the user specify, as with iconv and convmv, from the
command line.  Perhaps code could be stolen from iconv.

I don't know of a good list of cp -> utf-8.  But a couple sources that might be
useful to you are here:
http://nlso-objects.sourceforge.net/languagedata.php?pageindex=5
and the comprehensive iana list is here: 
http://www.iana.org/assignments/character-sets

But let me make one more pitch for *not converting*; we already have tools that
will do conversion after the fact, something that will work for all users right
out of the box.


Comment 7 Dmitry Butskoy 2007-06-21 17:25:58 UTC
> The problem is that sometimes I will want cp737 -> utf8, and sometimes
> iso8859-7 -> utf-8, and sometimes something else,
> depending on the source of the files. 

AFAIK Zip file standard is only capable of supporting a single language at 
a time, by using a single OEM code page for it. Now I've found a list of OEM
code pages:

        ' 437 OEM - United States 
        ' 737 OEM - Greek (formerly 437G) 
        ' 775 OEM - Baltic
        ' 850 OEM - Multilingual Latin I 
        ' 852 OEM - Latin II 
        ' 855 OEM - Cyrillic (primarily Russian) 
        ' 857 OEM - Turkish
        ' 858 OEM - Multlingual Latin I + Euro symbol 
        ' 860 OEM - Portuguese
        ' 861 OEM - Icelandic
        ' 862 OEM - Hebrew
        ' 863 OEM - Canadian - French
        ' 864 OEM - Arabic
        ' 865 OEM - Nordic
        ' 866 OEM - Russian
        ' 869 OEM - Modern Greek 
        ' 874 ANSI/OEM - Thai (same as 28605, ISO 8859-15) 
        ' 932 ANSI/OEM - Japanese, Shift-JIS 
        ' 936 ANSI/OEM - Simplified Chinese (PRC, Singapore) 
        ' 949 ANSI/OEM - Korean (Unified Hangeul Code) 
        ' 950 ANSI/OEM - Traditional Chinese (Taiwan; Hong Kong SAR, PRC)

it seems that when you create a zip file, all filenames are preserved in one of
these encodings.

The problem is there are two "Greek": CP737 and CP869 . Is there some way to
guess it at run-time?

> we already have tools that will do conversion after the fact

From the command line -- yes. But the main reason for writing this patch is
wrong filenames in file-roller window. (Some sites give info for users just in
zip files; when user go to such a link, browser downloads the zip file, then
file-roller is invoked, which invokes unzip).

Unfortunately it is impossible to patch a lot of applications who can invoke
unzip (either to handle encoding by themselves or to invoke unzip with some
"new" cmdline options), hence we should solve this issue in the unzip immediately.

But maybe some envrionment variable could help? (i.e. put UNZIP_OEM=CP737
somewhere in /etc/profile.d/* ...) And does nothing if this variable is not set...

Comment 8 Ariel T. Glenn 2007-06-22 00:11:43 UTC
Even if fileroller can't utilize them, I think unzip should have command line
switches.  For file roller and other apps, it would be nice if unzip had a
default behavior.  The problem is that there isn't a good choice of default. 
All of cp737, cp869, and iso8859-7 are different encodings.  I guess you could
test for the existence of characters that are in the unpermitted range of one of
these and rule it out that way, but that will only work in some of the cases.

Were you thinking of providing a way for the user to select the encoding from a
list in the fileroller gui, and then set the environment var based on that?


Comment 9 Dmitry Butskoy 2007-06-22 11:57:08 UTC
> I think unzip should have command line switches.
I agree, but it is a task for upstream unzip team. Else different distros will
have different flags for this, as usual :(

> The problem is that there isn't a good choice of default. 
In the case of Greek or for other locales too? (Just to decide whether some
"auto-guessing" could be useful at least for some wide-spread locales).

Comment 10 Simos Xenitellis 2007-06-22 15:43:28 UTC
First of all, a core difference between Windows and Linux is that in Windows you
can configure a basic legacy encoding; when a filename or text is not valid
UTF-16, it is assumed to be of that legacy 8-bit encoding and auto-converted.
Due to this, people end up with content of the legacy encoding.
I do not know whether it would make sense to replicate this functionality. It
should solve a big part of the migration issues. I think for the implementation
of this, it would require some changes in glibc (???). I am not sure if WinZip
works because it is smart about encodings or because it benefits from the basic
legacy setting in Windows. It appears 7zip for Windows cannot automatically fix
the encoding, but this problem might not be related (perhaps 7zip forces utf-8,
bypassing any autoconversion).

Secondly, the issue we are trying to fix with zip filenames is similar to the
ID3 tags in music files. Here as well, 8-bit encodings are being used to create
content. In addition to this, CDDB databases store song metadata in a variety of
encodings and do not try to keep sanity of the encoding (they can't solve the
problem). For example see
http://bugzilla.mugshot.org/show_bug.cgi?id=724
The way this should be solved is at the point of entry of the text of dubious
encoding.

The way forward that I see is to use a library that deals with the issue of
autoconverting text fragments that are not valid UTF-8 sequences to proper UTF-8
text. Apparently, such a library exists,
http://trific.ath.cx/software/enca/

I think that unzip should indeed have an additional command line option that one
can use to invoke the "autofixing" of the encoding, if required. If unzip finds
the "libenca" library, it will request to fix the filename.
Obviously, normal users of unzip that are not aware of the parameters will not
be affected.

file-roller, the KDE counterpart, etc, would need to make a small change in the
code, to add the extra command line parameter for unzip.

In general I see more complaints on ID3 tags with the wrong encoding than bad
Zip filenames, prompting for a generic library solution.

If it is too much effort to go for the library solution (or it's difficult to
get the unzip developers to add the option), I think it would be ok to keep the
cut-down solution proposed here.

Comment 11 Dmitry Butskoy 2007-06-25 15:33:02 UTC
"Enca" library does not work well on very short text fragments (i.e., short
filenames). It is possible that 4 or 5-byte filename will be converted wrong.

Unzip supports "UNZIP" environment variable, which allows to specify additional
options for command line. This way no modifications needed for file roller and
friends, but unzip authors should implement an option anyway...

After the option will appear, we just need a /etc/profile.d/unzip.sh profile
which specifies something like "UNZIP="-<option> CP747", or even computes the
appropriate codepage from the current locale settings.

Any thoughts?

Comment 12 Simos Xenitellis 2007-06-25 16:17:30 UTC
Enca, provided that its developers welcome the idea, can be used as a basis for
the purposes of matching the encoding. In the domain we are looking now (Zip
filenames), and future domains (IDv3 tags), one can make simplifications when
matching the encoding.

Therefore, in the case of figuring out the encoding of a small text fragment,
the library would need to take into account the locale of the system.

If the ZIP file has several files in it, then one option would be to delegate
the task of encoding detection to a level higher (file-roller application),
because file-roller can extract all file names first and detect the correct
encoding. 

In any case, the course of direction to sort out this issue depends on the
developer(s) that will undertake the task. Depending on resources, the priority
task could be to cover the biggest chunk of encoding problems, such as the
legacy encodings used in Windows.

Would it make sense to write up a blueprint on this?

Comment 13 Dmitry Butskoy 2007-06-26 12:47:38 UTC
> in the case of figuring out the encoding of a small text fragment,
> the library would need to take into account the locale of the system.

But this way the enca itself seems to not be needed -- for most of locales, we
can determine the appropriate codepage just by the locale of the system. It is
exactly what I do in the patch...

But in general, hinting the language for the enca does not much help -- see
"enca -l languages". F.e. for russian, the short filename can be determined
either IMB866, or CP1251, or KOI8-R ...

> Would it make sense to write up a blueprint on this?
Perhaps. What do you mean exactly?


Comment 14 Simos Xenitellis 2007-06-26 15:07:36 UTC
> > Would it make sense to write up a blueprint on this?
> Perhaps. What do you mean exactly?

Apparently I meant specification. Sorry. A "blueprint" is launchpad lingo and I
created such a thing to follow the progress at 
https://blueprints.launchpad.net/unzip/+spec/unzip-detect-filename-encoding
Please see the references for some extra information.

What's missing is some Wiki page that will host the specification. Since the
initiative is from this community, it should be good to host on
http://fedoraproject.org/wiki/
What's a good place to put such a page with the specification?
Please start the page and I'll help out with the specification.
If there is not such suitable location, please tell me so we can find some other
location.

The specification should include
1. We put the change in "unzip" and not the applications that make use of
"unzip" because unzip is the source of the problem.
2. We do not make the autodetection of the encoding of the filenames a default
feature (at least not yet), so that all those utilities that use unzip will
continue to exhibit the same expected behaviour
3. We add a command line option to unzip that would enable the attempt to
autoconvert the filename encoding to UTF-8.
4. Tools such as file-roller would need a simple patch; if the system encoding
is UTF-8, use the special command line option in unzip when converting filenames.
5. We need to contact the makers of unzip et al, at
http://www.info-zip.org/pub/infozip/ Without their agreement, we cannot push
this upstream, and we are stuck. In an extreme case, we might have to maintain a
distribution-specific patch.
6. The code that does the autodetection should be "portable" (easy to
self-compile on its own) so that it can be used with test cases that we are
going to make. We do not know yet what are the common initial encodings that
will work with most languages.

We shall advertise the specification to interested parties and may get some
input, then implement. Currently you are happy to implement this; please tell me
if the process looks too slow or you want to change something.

Comment 15 Dmitry Butskoy 2007-06-26 15:16:45 UTC
Simos,

Could you please repeat all your thoughts at fedora-devel-list ?
I think this discussion should be moved there, for more people can participate then.

Comment 16 Simos Xenitellis 2007-06-26 19:16:14 UTC
Dmitry, while writing the blueprint I found that AltLinux has a similar patch,
which is now also used in Ubuntu,
https://bugs.launchpad.net/debian/+source/unzip/+bug/10979

Can you have a look at it and provide some comments.


Comment 17 Ariel T. Glenn 2007-06-26 19:51:49 UTC
Just a note that I am happy to put some coding time into this, with the caveat
that I cannot do cross-platform work, only code and testing on linux platforms.

I'd want to see two sets of command line options though: one would be the
"autoattempt" switch, and the other would allow the user to fully specify which
codepage/character set the filenames are likely in, in case the automated guess
fails. 



Comment 18 Simos Xenitellis 2007-06-26 21:06:22 UTC
@Ariel: By all means, go ahead and try out what you have in mind. I think we
have covered all cases of prior work on this. It is nice to have a good look at
the patch that has been added in Ubuntu.

Comment 19 Dmitry Butskoy 2007-06-27 10:28:07 UTC
for comment #16 :
> AltLinux has a similar patch,

I know about it. This patch adds two options "-I" and "-O" to specify "ISO" and
"OEM" encodings respectively. But I'm doubt whether is is useful for end users.

The destination charset can be determined from the current locale by
"nl_langinfo(CODESET)" (It might be either UTF8 or some legacy 8bit encoding).
Then we need only one option, to specify the source charset (both OEM and ISO --
just cause unzip to recode filenames properly).

for comment #17 :
> I'd want to see two sets of command line options though
The autodetection stuff could be chosen just by using "auto" in the place of
encoding name (f.e. "-F CP747", "-F CP866", "-F auto" etc.)

Anyway, to avoid an extra options mess between different distributions, the
choice of the new option(s) must be coordinated with upstream...

BTW, the current unzip beta (upzip60c) has the ability to recode text files
(using iconv()), see "-a" option. It seems thet unzip upstream is ready to
implement filename recoding as well.


Comment 20 Andrew Zabolotny 2007-11-04 11:47:01 UTC
One more note, there are two environment variables that affect the
interpretation of the file name encoding on current system:

G_FILENAME_ENCODING.   This environment variable can be set to a comma-separated
list of character set names. GLib assumes that filenames are encoded in the
first character set from that list rather than in UTF-8. The special token
"@locale" can be used to specify the character set for the current locale.

G_BROKEN_FILENAMES.  If this environment variable is set, GLib assumes that
filenames are in the locale encoding rather than in UTF-8. G_FILENAME_ENCODING
takes priority over G_BROKEN_FILENAMES.

I understand that this is somehow gtk-specific, but perhaps it would be good if
zip would honor these variables to avoid unsynchronized behaviour of different
programs.


Comment 21 Simos Xenitellis 2007-11-05 01:49:04 UTC
Thanks for this addition Andrew.

There is more documentation at
http://library.gnome.org/devel/glib/unstable/glib-Character-Set-Conversion.html#file-name-encodings

Per http://live.gnome.org/GuideForISVs
G_BROKEN_FILENAMES appears to have become obsolete.

Comment 22 Nadav Kavalerchik 2007-11-09 19:29:06 UTC
i have this issue with Hebrew filenames inside zip archives i get from windows
xp systems.
but...
it appears, gunzip does decompresses and opens stored files with hebrew
filenames that are encoded iso-8859-8 (or windows-1255) correctly.


Comment 23 Nadav Kavalerchik 2007-11-09 23:00:37 UTC
i've patched unzip60c latest development sources with the attached "support for
non-latin1 filenames in archive" and successfully opened zip files that came
from windows xp including archived filenames that where made up of windows-1255
or iso-8859-8 characters.
i had to make sure LANG was set to he_IL.utf8.
using the cli unzip all went well.
using KDE's Ark or GNOME's File-Roller i got latine1 file names inside the GUI
view _BUT_ it was extracting properly after all.

Comment 24 Nadav Kavalerchik 2007-11-09 23:44:53 UTC
little bash script to open zip files with hebrew encoded filenames that include
spaces. ( using the patched unzip60c !!! from last comment #23)

#!/bin/bash
LANG=he_IL.utf8
FULLNAME="$1"
NAME=${FULLNAME##*/}
PATHPART=${FULLNAME%%$NAME}
/usr/bin/unzip "$FULLNAME" -d "$PATHPART"


Comment 25 gordone 2007-11-10 06:44:52 UTC
I'm one of the developers working on UnZip 6.0 and saw the mail to us.  I've
only scanned this thread quickly so far, so definitely tell me if I'm not
following something.

First, Info-ZIP has put together an extension to the Zip standard (AppNote) that
allows storing and restoring UTF-8 paths.  This allows zipping paths on, say,
Windows and unzipping them on maybe Unix.  The approach took a few months to
coordinate but WinZip and PKWARE have agreed to our approach and it is in the
latest AppNote update.  In brief, the zip converts the local paths to UTF-8 and
stores them in an extra field.  The unzip can read the UTF-8 and convert to the
local character set.  This avoids translations between character sets and
dependence on the utilities to do that.  It also is completely automatic.  Betas
Zip 3.0f (published) and UnZip 6.0d (not published yet) support this and some
testing has been done.  Note that beta Zip 3.0g with mostly minor bug fixes is
close to posting and Zip 3.0 is getting close to going out the door, but UnZip
may have a little work left.

Second, it looks like much of this discussion may still be worth looking at for
implementation.  Non-Unicode archives and tools will continue to exist and
allowing UnZip to handle non-Unicode archives is definitely useful.  It looks
like this change only impacts UnZip, so Zip 3.0 probably can stay on schedule
for release maybe in a month.  Further, if it takes some time to get UnZip 6.0
released this patch may be all there is for now.

I've quickly scanned the attached 5.52 patch.  A question is the need to detect
if the appropriate libraries exist on the system as we support various Unix
platforms.  For UnZip 6.0 we probably can use the new configure script.  I'm not
sure how the group will go on this, decide against it, patch the old code, patch
the new code, or patch both.  It's also possible we may just post the (possibly
updated) patches but not include them in the main code.

Though UnZip 6.0d is still being worked, it may be stable enough to post public
soon, but I can't say when that will happen.  It's possible for individuals to
get access to our internal betas, but we strictly control those bug-ridden
things until they're tested enough for public posting.

Comment 26 Ivana Varekova 2008-03-25 15:53:18 UTC
This issue should be solved by upstream maintainers so thank you for your work
Dmitry could you send your patch to upstream and discuss it there.

Comment 27 Dmitry Butskoy 2008-03-25 15:59:33 UTC
I have no enough time for now to work with this further. OTOH I hope that
upstream is laready aware of it...

Comment 28 xandry 2010-03-13 20:23:15 UTC
(In reply to comment #26)
> This issue should be solved by upstream maintainers so thank you for your work
> Dmitry could you send your patch to upstream and discuss it there.    

The problem is not solved. Precisely. Till now has the described problems. They dare patch imposing. It is checked up on upstream version unzip 6.0.

Comment 29 gordone 2010-03-14 08:21:03 UTC
Please be more specific as to what is not solved.

As the conversion between character sets (code pages) generally requires the from and to character sets (code pages) to be identified, which can be difficult to impossible in some situations, Info-ZIP chose to go with storing UTF-8 encodings that then could be converted to the destination character set without knowledge of the from character set.  This approach has been picked up by PKWare and WinZip and is reflected in the latest AppNote (Zip standard).  New archives should be created with this information stored to allow filenames (and file comments if supported) to be converted to other character sets.

However, this does not address older archives without UTF-8 information.  As far as I know, the preference in the Zip developer community has been to require users to update to newer tools that support UTF-8 filename storage rather than support storing and converting specific character set encodings.  That said, Info-ZIP has been looking at implementing this change to support older archives, but given other priorities on the queue (adding the latest compression methods, etc.) we haven't gotten to it or even come to consensus within the group to support it.  There are also issues relying on outside libraries such as iconv to perform character set conversions, though they seem workable.

So please be more specific as to what you want done.

(Also, due to other things going on lately, Info-ZIP development has gone from slow to almost nonexistent lately.  It should pick up to slow again shortly.)

Comment 30 xandry 2010-03-14 09:51:44 UTC
I simply wish to receive the unsqueezed archives in the coding which probably to read. Without this patch I receive rhombus instead of cyrillic characters. 

Example:
$ unzip Tracktor\ Bowling\ -\ Черта.zip 
Archive:  Tracktor Bowling - Черта.zip
 extracting: Tracktor Bowling - �����/Tracktor Bowling - ���.mp3  
 extracting: Tracktor Bowling - �����/folder.jpg  
 extracting: Tracktor Bowling - �����/Tracktor Bowling - �����.m3u

After patch imposing:
$ unzip Tracktor\ Bowling\ -\ Черта.zip 
Archive:  Tracktor Bowling - Черта.zip
 extracting: Tracktor Bowling - Черта/Tracktor Bowling - Сны.mp3  
 extracting: Tracktor Bowling - Черта/folder.jpg  
 extracting: Tracktor Bowling - Черта/Tracktor Bowling - Черта.m3u

Alternative solution is:
convmv -f iso8859-1 -t cp850 -r --notest --nosmart .
convmv -f cp866 -t utf8 -r --notest --nosmart .

What a additional information is needed?

Comment 31 Vasiliy Glazov 2010-04-06 07:43:51 UTC
Bug still not fixed.

Comment 32 Vasiliy Glazov 2010-04-06 08:47:52 UTC
We, Russian fedora team (https: / / fedoraproject.org / wiki / Ru_RU / Russian_Fedora), please include an existing patch in the assembly rpm in Fedora, and not wait for upstream zip.

Comment 33 Dmitry Butskoy 2011-07-19 16:33:52 UTC
Created attachment 513838 [details]
support for not-latin1 filenames for version 6.0

The same patch as of 2007, but for the current unzip-6.0

Comment 34 gordone 2011-07-19 23:51:13 UTC
Please look at the -I and -O options in the UnZip 6.10b beta.  These come from a previous patch submission.  If these do what you need, let us know.  If not, please suggest improvements or, if they are hopeless, then alternative code.

We might also be interested in other solutions if they make life easier for the user.

It looks like we're getting close to releasing Zip 3.1 and UnZip 6.10, maybe in the next couple months for Zip and maybe September for UnZip.  It would be good to get all this worked out before the final release candidate for UnZip goes out maybe late next month.  Here's your chance to impact a release that may be out shortly.  (Sorry Info-ZIP doesn't do that more often.)

Comment 35 Dmitry Butskoy 2011-07-20 12:33:43 UTC
The main problem with '-I'/'-O' design is the need of users to *manually* specify input and output encodings. It is inconveniently itself. Moreover, when unzip is invoked under, say, file-roller, there is no possibility to specify such options at all.

My patch has another idea -- try to guess the codepage automatically by the information of the current Linux locale (by the language setting).

Most cases the actual codepage corresponds to the same language under which the user is working. Hence we can avoid manual user intervention for specifying of codesets.

Certainly the best idea was to have both '-I'/'-O' options (for manual setting) and an automagic way when some of '-I'/'-O' is not specified. Use my patch anyway to implement the final solution.

Comment 36 gordone 2011-07-21 00:00:18 UTC
Do you think you can redo your patch against UnZip 6.10b?  There has been changes to those files since UnZip 6.00 and we need to focus on getting the betas ready to go out.  Your changes need to work with the -I/-O feature.

Actually I remember a user proposing yet another approach using a character recognition library.  At some point that might be the better approach, but when we looked at it a couple years ago I think the library was not generally available on many platforms and we decided to hold off on implementing that approach.

Comment 37 Dmitry Butskoy 2011-07-21 13:15:18 UTC
Well,

6.10b already has some logic for auto-detecting of OEM_CP.
Not very correct logic, but as an initial step... :)

Now, unix/unix.c:init_converstions_charsets() obtains the locale charset and then tryes to determine OEM_CP by it. 

But most modern systems now use UTF-8 as locale charset. Hence, you always get "UTF-8", which is not informative at all for our needs (in the 6.10b you always use "CP866" when utf-8 etc.)

The true way is to obtain local _language_, not charset -- ie. when LANG=ru_RU.UTF-8 you get "ru_RU". Then, using some table with pairs like
"<lang> -- <typical_OEM_CP_for_this_lang>"
we should determine the most correct OEM_CP .

For example:
...
{ "en_US", "CP850" },
{ "en_UK", "CP850" },
{ "ru_RU", "CP866" },
...
and so on.

I can provide some initial patch for this (if it is needed), but for the full "lang<-->OEM" table some additional researches should be done by somebody else.

> Actually I remember a user proposing yet another approach using a character
> recognition library.
Yes, there is "enca" library.

But for the correct character recognition the inspected data should be long enough (several tens of bytes) and should be correct language phrases (for the statistical analysis). Actually, file names often are short (fe. one word of some bytes) and often are acronyms. Hence character recognition seems not applicable here.

Comment 38 gordone 2011-07-23 03:53:11 UTC
Anything you can give us to start with should save us time, and it really comes down to selecting what we can do in the time we get, so saving us time puts this higher on the list.  Also, you probably know more about this than me, though that may change by the time the patch you provide is researched, wired in, tested, and documented.  Oh, any additional description of the patch's operation would be helpful.

Feel free to correct any issues with the current -I/-O code as well, if you can.

Would appreciate you doing what you can, including creating any needed tables and putting any mappings you know.  I suspect once the framework is there others will want to add mappings to it.

Agree that character recognition using relatively short file paths probably would not be reliable enough to be useful.

Note that UTF-8 is still the current zip community approach to converting paths.  Specifically, most modern zip utilities tend to either (1) store the UTF-8 path directly (and should then set the new UTF-8 path flag added a few years ago to the zip standard) or (2)

Comment 39 gordone 2011-07-23 03:58:17 UTC
Not sure what happened there.  Continuing...

or (2) store the local path and use the UTF-8 extra fields.  This approach is more for dealing with less capable utilities and older archives.

We would prefer, for backward compatibility, that this auto detection require an option to enable.  If you can live with that, feel free to suggest an option letter or we'll do that.

Comment 40 Dmitry Butskoy 2011-07-25 16:06:47 UTC
Created attachment 515107 [details]
The patch (hypothetical) for 6.10b

Well,

I've create this patch with the idea of lang<-->codeset table.

Certainly the table is not complete. Moreover, for some locales it might be needed to check full language name (as of xx_XX) instead of the short one (xx).

I am not sure whether it is useful to enable such auto-detection explicitly only -- normally we should avoid our users from running unzip each time with '-O its_cp', or causing them to put 'alias unzip="unzip -I its_cp"' into its .bashrc -- it seems much better to enable auto-detect by default...

Comment 41 gordone 2011-07-26 00:43:36 UTC
Got the patch.

Typically we don't like making changes that impact default behavior in a way that is not backward compatible.  Unless what Zip or UnZip are doing now is considered broke to the point of being near useless.  Not sure about this situation.  It seems that this change should do the same thing where it makes sense and only change things when it doesn't.

Anyway, we'll probably need to discuss this in the development group, but this change may make it into the next UnZip public beta, which we are starting to prepare now.  Might go out in a couple weeks.

Comment 42 Mikhail 2015-06-01 21:06:34 UTC
Bug still not fixed. Why?

Comment 43 Mikhail 2015-06-01 21:09:45 UTC
Created attachment 1033515 [details]
screenshot of File Roller

Comment 44 Ivan Romanov 2015-06-11 10:15:57 UTC
There is a modern patch to fix thix.
https://bugzilla.redhat.com/show_bug.cgi?id=890188
I think this task can be closed. And will hope that libnatspec patch will be applied.

Comment 45 Petr Stodulka 2015-11-23 14:54:56 UTC
Oh, I miss this bug. It's already patched  in rawhide - update of released fedoras isn't planned. Alternatively for released fedora you can try unzip from copr repository, where patch is applied as well:

# dnf copr enable pstodulk/unzip

Comment 46 Mikhail 2015-11-23 15:16:42 UTC
(In reply to pstodulk from comment #45)
> Oh, I miss this bug. It's already patched  in rawhide - update of released
> fedoras isn't planned. Alternatively for released fedora you can try unzip
> from copr repository, where patch is applied as well:
> 
> # dnf copr enable pstodulk/unzip

I am install unzip from "pstodulk/unzip" but it not helps :(

Comment 47 Mikhail 2015-11-23 15:17:14 UTC
Created attachment 1097704 [details]
screenshot 1

Comment 48 Mikhail 2015-11-23 15:17:46 UTC
Created attachment 1097705 [details]
screenshot 2

Comment 49 Petr Stodulka 2015-11-23 15:21:23 UTC
I see, you didn't use -I parameter. See help

Comment 50 Mikhail 2015-11-23 16:35:13 UTC
(In reply to pstodulk from comment #49)
> I see, you didn't use -I parameter. See help

$ unzip -I 866 otchety.zip 

After applying "I 866" option extracted files names are right, but in console still saw '??????????????'.

Comment 51 Mikhail 2015-11-23 16:35:36 UTC
Created attachment 1097765 [details]
screenshot 3

Comment 52 Mikhail 2015-11-23 16:37:49 UTC
(In reply to pstodulk from comment #49)
> I see, you didn't use -I parameter. See help

And what about the graphics utilities such as file-roller and Midnight Commander? It is impossible setup option "I" for they. :(

Comment 53 Mikhail 2015-11-23 16:43:21 UTC
(In reply to Mikhail from comment #50)
> (In reply to pstodulk from comment #49)
> > I see, you didn't use -I parameter. See help
> 
> $ unzip -I 866 otchety.zip 
> 
> After applying "I 866" option extracted files names are right, but in
> console still saw '??????????????'.

Very interesting, I extracted files with correct file names without option -I too, but I couldn't get correct file names when I get list file names with option -l.

Comment 54 Petr Stodulka 2015-11-23 19:48:03 UTC
Hmmm...do you mean list of filenames inside archive or just uncompressed files on system? If it is the first case, can you provide/upload the archive for testing?

Comment 55 Mikhail 2015-11-23 19:54:08 UTC
(In reply to pstodulk from comment #54)
> Hmmm...do you mean list of filenames inside archive or just uncompressed
> files on system? If it is the first case, can you provide/upload the archive
> for testing?

yes. uncompressed files on system is ok (without I option), but list of filenames inside archive still have this issue.

Comment 56 Mikhail 2015-11-23 19:54:52 UTC
Created attachment 1097835 [details]
archive for testing

Comment 57 Petr Stodulka 2015-11-23 21:23:55 UTC
Thanks Mikhail,
that's stil issue from my point of view, which should be fixed. I will look at it.

Comment 58 Petr Stodulka 2015-11-25 11:14:25 UTC
Created attachment 1098716 [details]
fix print issue

Comment 59 Petr Stodulka 2015-11-25 11:17:39 UTC
New build in copr will be completed soon. You can try it. Btw, for this you should use -O parameter instead of -I, but it can be guessed by unzip correctly for CP866.

Comment 60 Mikhail 2015-11-25 14:25:00 UTC
Thanks, I see that issue is fixed for console unzip and Midnight Commander, please see my attached video.

But still not fixed for file-roller.

Comment 61 Mikhail 2015-11-25 14:25:47 UTC
Created attachment 1098839 [details]
screencast

Comment 62 Petr Stodulka 2015-11-27 09:20:34 UTC
That's problem of file-roller now. It was already reported - see bug 1177950

Comment 63 Mikhail 2015-11-27 09:25:18 UTC
(In reply to pstodulk from comment #62)
> That's problem of file-roller now. It was already reported - see bug 1177950

I can't read and subscribe to this issue :(


====================================================
You are not authorized to access bug #1177950.

Most likely the bug has been restricted for internal development processes and we cannot grant access.

If you are a Red Hat customer with an active subscription, please visit the Red Hat Customer Portal for assistance with your issue

If you are a Fedora Project user and require assistance, please consider using one of the mailing lists we host for the Fedora Project.

Comment 64 Petr Stodulka 2015-11-27 09:33:23 UTC
Ah. Sorry I miss that. So you can report it to fedora rawhide if you want.


Note You need to log in before you can comment on or make changes to this bug.