Description of problem: Since RHL 8.0 exists the problem of migrating filenames in ISO-8859-* encoding to UTF-8 (or otherwise filenames with non ASCII-chars are not accessible by some applications) Convmv does this easily, it should definitly go in before next release. It is very small (~20KB) and needs only some already included perl modules. $ rpm -q --requires convmv /usr/bin/perl perl >= 0:5.008 perl(Cwd) perl(Encode) perl(File::Basename) perl(File::Compare) perl(File::Find) perl(Getopt::Long) perl(Unicode::Normalize) perl(bytes) perl(utf8) rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 Maybe it should also be provided as an errata for Psyche and Shrike. I can attach my spec file if you like. -- what is the right component for package requests?
distribution
This can be done pretty simply with a shell wrapper around iconv: I don't think we need a special program for this ATM.
This is not about file contents but file names. And this is not simple enough with shell + iconv. (do you write it within 5 minutes? show it!) The UTF-8 default in Psyche was a real nightmare, as many files were not accessible with file managers. Renaming them manually in an xterm was a real pain. And I'm sure not everybody has migrated yet.
for file in $(find foodir -type f) do base=$(basename $file | iconv -f foo -t UTF-8) dir=$(dirname $file) mv $file $dir/$base done is the simple variant someone has posted internally at some point. Assigning to someone who was debating what to do about this at some point.
See ftp://people.redhat.com/hp/recode-files.c
ftp://people.redhat.com/hp/recode-files.c is not world readable things a conversion tool should be aware of: - already converted files - symlinks and there should be a hint in the release notes that such a thing exists
Sorry, I chmod'd it now. I don't think we'll add to release notes unless we add it to the distribution, right now it's just something people can try out if they want and comment on whether to include. Changing symlink targets... hmm. Sounds hard but possible. I don't know of any way to reliably handle already-converted files (this is the same problem as "handle a filesystem with filenames of mixed encoding" - and there's simply no reliable way to autodetect filename encodings, so I don't see how you can handle this reliably).
I don't think it's very likely that valid UTF-8 strings make sense when interpreted with other encodings. (maybe it's completely irrelevant in praxis) And it would be better to not recode it than double encode it.
Created attachment 93196 [details] convmv perl utility it's really readable, you should have a look
> I don't think it's very likely that valid UTF-8 strings make sense when > interpreted with other encodings It's extremely likely, because all possible bytes are valid in the Latin-* encodings and other 8-bit encodings... The perl script does look interesting, we should consider it for sure. I was just posting the C program since it's what we were discussing previously.
Valid doesn't mean it also makes sense. I for myself haven't seen any utf-8 which looked useful when viewed as latin-*. And I can only reiterate it's better to leave files untouched than double encode them.
Can you define "makes sense" in terms of a computer algorithm ;-) This would be an unsolved problem in AI. Especially since filenames need not be any kind of dictionary word or combination thereof.
If filenames contain non-ASCII characters, they are mostly dictionary based. But it really doesn't matter that much, see my last sentence above. ;)
Havoc, ISO8859-* encodings and most other encodings can really not be "guessed" very good but UTF-8 can be "guessed" *very* reliable and convmv does that. Convmv can also convert from and to NFC and NFD which is important for MAC OS X interoperability. Important is also that convmv does efficien checks for invalid and unsufficient charsets. Though I wrote convmv for UTF-8 locale migration it turned out that it's mainly used this days for Samba repository conversions when people migrate to Samba3 or change "unix charset" option.
I think this is useful but the bug is just in limbo while assigned to me; we need someone who will in fact do the work to package/maintain.
inside or outside RH?
Outside is fine, though at the moment outside would have to be done via fedora.us (hopefully this is changing soon...) Inside is fine too of course.
convmv is available in extras. I still think it should be in core.
REOPENED status has been deprecated. ASSIGNED with keyword of Reopened is preferred.
Core/Extras are merging, solving this problem.