Red Hat Bugzilla – Bug 668282
PackageKit yum backend uses incorrect encoding for dynamic category names, makes them show up with '?' characters in KPackageKit
Last modified: 2014-01-21 18:17:27 EST
Created attachment 472457 [details]
screenshot for application
Description of problem:
Rendering for Group name in 'Add Remove Software', where gpk-application showing those name correctly (data is UTF-8).
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. run kpackagekit in Punjabi (pa_IN)
2. Add/Remove Software -> Categories->Application (or any other)
showing ???? characters instead of actual text
actual Punjabi text must be shown
Click on History, it is showing text properly
Created attachment 472458 [details]
gpk application is showing same text
> Rendering for Group name in 'Add Remove Software', where gpk-application
> showing those name correctly (data is UTF-8).
Already known, discussed on #fedora-kde on IRC.
It's a bug in PackageKit's yum backend, it's handling the encoding incorrectly. Richard Hughes is already aware of the issue, it should get fixed shortly.
> gpk application is showing same text
That's not the same text. gnome-packagekit uses the hardcoded groups directly from the PackageKit core, KPackageKit uses dynamic categories which are returned by the backend.
The advantage of dynamic categories is that you get the full list of categories from comps, not just a subset.
The big is fixed by this:
Author: Richard Hughes <firstname.lastname@example.org>
Date: Tue Jan 4 12:39:37 2011 +0000
yum: Ensure the category data is valid UTF8
:100755 100755 0edce6d... 84985af... M backends/yum/yumBackend.py
Basically, yum isn't always emitting valid utf8 for category names for some locales. I'll push a new PK release out in a few weeks that fixes this. If someone wants to build an update for F14 and submit it to bohdi, you have my blessing, but I've got a TODO list longer than my arm this week. Sorry.
Thanks, I'm taking care of this.
PackageKit-0.6.11-2.fc14 has been submitted as an update for Fedora 14.
PackageKit-0.6.11-2.fc14 has been pushed to the Fedora 14 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
su -c 'yum --enablerepo=updates-testing update PackageKit'. You can provide feedback for this update here: https://admin.fedoraproject.org/updates/PackageKit-0.6.11-2.fc14
I tried by update packages:
PackageKit and kpackagekit, but still I am getting "????" characters only.
(going to try again by rebooting machine)
PackageKit-0.6.11-2.fc14 has been pushed to the Fedora 14 stable repository. If problems still persist, please make note of it in this bug report.
rhughes: Your fix does not work!
I backported the upstream commit you pointed me to, it got pushed, I upgraded to it myself, but after a reboot, I still get the same old question marks in KPackageKit.
In comment #7, A S Alam also says the fix doesn't work for him. (FWIW, next time give negative karma, so you'd have compensated that overzealous tester who gave it a +1 without actually testing the bug. I should have canceled the push, but since you said you were "going to try again by rebooting machine" and you didn't post anything after that, I assumed that a reboot fixed it for you.)
(In reply to comment #9)
> I backported the upstream commit you pointed me to, it got pushed, I upgraded
> to it myself, but after a reboot, I still get the same old question marks in
Hmm. If that's the case, I apologize, but it certainly fixed it for me. Did you try a "pkcon refresh" after installing the update?
I've just done one now, it didn't help.
(In reply to comment #9)
> In comment #7, A S Alam also says the fix doesn't work for him. (FWIW, next
> time give negative karma, so you'd have compensated that overzealous tester who
> gave it a +1 without actually testing the bug. I should have canceled the push,
> but since you said you were "going to try again by rebooting machine" and you
> didn't post anything after that, I assumed that a reboot fixed it for you.)
Next time I will take care and provide -1 karma. (this time I was trying to overcome another issue (lokalize crash and junk charater) and for that I did various random things, so I was not sure about my results whether those are correct or not. After that upgrade to rawhide to work with lokalize, so not update with 'reboot' result).
Don't worry, we didn't actually break anything, the only harm done is the unnecessary update. And it's mostly my fault (but I blame the policies: with the old system, I'd just have kept karma automatism disabled and decided myself whether the feedback was sufficient or not; with the new system, if I do that, I have to wait a full week to push my updates no matter how good the feedback was, so there's strong pressure to enable karma automatism with all its brokenness).
Ping? Any chance we can get a fix for this which actually works?
Hmm, I can't reproduce. Is there a good way to reproduce this on a F15 system? Thanks.
I have the same problem in French.
Looking around, I tried to play with QTextCodec, without success.
Another little issue is that Categories are displayed in English, whilst Groups are displayed in French, but with special characters displayed as "?".
I confirm this bug for Russian language too.
Created attachment 495417 [details]
I confirm this bug for Russian language.
Any update on this? I can confirm this on F15 + Czech language as well
This also affects other strings, e.g. update details. It does not happen with the zif backend, so this is definitely an issue in the yum backend.
*** Bug 676488 has been marked as a duplicate of this bug. ***
The duplicate Bug 676488 was accepted as a release blocker, so I guess this one should be marked as a blocker now.
Okay, so there's something odd going on. In yumBackend.py we have (line 705):
name = grp.nameByLang(self.lang)
So it should mean we can return valid UTF8 by making it:
name = _to_unicode(grp.nameByLang(self.lang))
...like we do in so many other places. But alas, this seems not to work, and packagekitd still emits lots of text 'Son et vid�o' was not valid UTF8! and refuses (correctly) to pass this on to the clients.
So, if we debug this a little, we find that the clause:
if not isinstance(name, unicode)
...fails in _to_unicode(), so it basically looks like yum is sending us a string that purports to be unicode, but contains invalid encoding.
We can't even do:
name = grp.nameByLang(self.lang)
name = unicode(name, 'utf-8', errors='replace')
As python (again, correctly) refuses to decode unicode like that.
To double check, if we do:
... we get "Error Value: 'ascii' codec can't encode character u'\xe9' in position 23: ordinal not in range(128)"
So, unless I'm misunderstanding something, or unless yum is emitting this as valid unicode (but UCS-2 or UTF-16, which would be very odd) I think this is probably a yum / python bug.
I'll reassign it for comments.
> So it should mean we can return valid UTF8 by making it:
> name = _to_unicode(grp.nameByLang(self.lang))
nameByLang() returns unicode strings, so _to_unicode is a no-op here.
name.encode('UTF8') should return UTF8 encoded string.
(In reply to comment #25)
> nameByLang() returns unicode strings, so _to_unicode is a no-op here.
> name.encode('UTF8') should return UTF8 encoded string.
Do we know what encoding it is already? If .encode() is what I should be doing, can you suggest a patch to https://gitorious.org/packagekit/packagekit/blobs/master/backends/yum/yumBackend.py#line122 to make it work on this kind of string too please? Thanks.
Python unicode string is internally UCS2 or UCS4 (configurable at build time), we don't want to touch that directly.
_to_unicode() is fine, it just converts non-unicode string to unicode, interpreting it as UTF8. The bug is probably somewhere else- 'unicode' being converted to 'gchar*' incorrectly.
dupe was an acceptedblocker, so this should be too.
> So, if we debug this a little, we find that the clause:
> if not isinstance(name, unicode)
> ...fails in _to_unicode(), so it basically looks like yum is sending us a
> string that purports to be unicode, but contains invalid encoding.
I'll try and help, as I have to deal with this confusion a lot in yum. What Zdeněk said is correct ... if the thing you have is "isinstance(name, unicode)" then you should think of it as a unicode object, it "can't" be bad they are all the same etc.
> To double check, if we do:
> print name.decode('utf-8')
> ... we get "Error Value: 'ascii' codec can't encode character u'\xe9' in
> position 23: ordinal not in range(128)"
One, _very_, confusing thing is that both str() and unicode() objects have .decode() and .encode() methods ... and coming from the C world I tend to think of "utf-8" as the natural type which should be encoded to python's internal "unicode" object, but that's exactly the opposite of what you should do.
This also means you can do:
"⅓" (fine a bytes object with utf-8 in it)
"⅓".decode('utf-8') (also fine, but now a unicode object)
"⅓".decode('utf-8').decode('utf-8') (raises: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2153' in position 0: ordinal not in range(128)).
...which is nice.
Which is why we use to_unicode() and to_utf8() everywhere in yum.
The last bit, which is likely where the error is coming from is that (because python defaults to the ascii codec):
"⅓" + "a" (fine)
"a" + "⅓" (fine)
"⅓".decode('utf-8') + "a" (fine)
"a" + "⅓".decode('utf-8') (fine)
"⅓".decode('utf-8') + "⅓".decode('utf-8') (fine)
"⅓".decode('utf-8') + "⅓" (raises)
"a".decode('utf-8') + "⅓" (raises)
So this means to_unicode is exactly the wrong thing to use in PackageKit-yum, it should be using to_utf8 instead, right?
> it should be using to_utf8 instead, right?
Likely yes. Python does not mix utf8 and unicode well. Possible solutions are:
a) Whenever some API returns python unicode object, convert it to utf8 and pretend unicode does not exist.
b) Convert everything to unicode. 'yum' uses this.
to_utf8 and to_unicode are wrappers that skip conversion if already done.
Python then understands utf8 when implicitly converting strings to unicode, so
'unicode + utf8' works as expected. It's effectively b) but without the need to call to_unicode explicitly. It has issues too. http://lists.baseurl.org/pipermail/yum-devel/2011-March/008034.html
If most of the code expects utf8, using a) is probably best.
Created attachment 529473 [details]
Something like this?
I think you need to call _to_utf8 on ALL strings returned by yum. I've also seen question marks in at least update descriptions and titles of update Bugzilla links.
Oh, and update RPM changelogs, too.
Richard is apparently on vacation this week, so we need someone else to finish up his patch and fix this.
Fedora Bugzappers volunteer triage team
Created attachment 529928 [details]
UNTESTED patch (should be more complete than Richard's)
This (proof-of-concept) patch looks very huge, but it basically boils down to 4 things:
1. adds the _to_utf8 function from Richard's patch,
2. adds the missing _to_utf8 call added by Richard's patch,
3. does a s/_to_unicode/_to_utf8/g – AFAICT, all those calls are on strings returned by yum, before passing them to GLib API. GLib wants UTF-8, not UTF-16/32.
4. removes the no longer used _to_unicode function.
WARNING: For all I know, this might screw everything up royally. I think this should be 1. proofread by somebody more familiar with Python and 2. tested.
Created attachment 529967 [details]
UNTESTED patch that adds more robustness on top of Kevin's fix
Looks decent. It seems like there's a few assumptions in the original code that you're porting that might not bear out in practice. Here's a slightly revised patch that uses python-kitchen's to_unicode and to_bytes methods. Also untested but it should be more robust in corner cases. (One of which, passing an object to to_unicode/to_utf8 is sprinkled quite liberally throughout this patch :-)
The get_update_detail() function looks like it would benefit from a rewrite with a determination to not mix unicode string and byte str types together but I didn't want to get too invasive with a patch this close to the deadline.
The assumption I'm making is that the APIs that we're using here must all be utf-8 encoded bytes (neither python unicode type nor other encodings (like shift-js, big5, or latin-1) would work). With that in mind, we need to be more pedantic about just what data we're passing through. (Making sure to convert byte strings which are not utf-8 encoded into utf-8.)
Also, constructions like:
except Exception, e:
self.error('%s' % _to_unicode(e))
wouldn't do the right thing under the old code. (the "e" variable would be an exception object so the _to_unicode()/_to_utf8() function in yumBackend.py would have passed that object through to the string format unchanged. That could have raised an exception depending on the user's locale settings, the exception message, and whether that message was a unicode string or a byte str.) Porting to the kitchen to_bytes() function fixes that as to_bytes() will convert an object to a str representation of the object.
Note that my changes are just to help prevent issues that were present in the original code and in the simple port by Kevin don't suddenly raise tracebacks at a later point. The actual fixing of this particular bug looks like it would be fixed via the logical changes that Kevin's patch makes. This is just a more robust implementation of those changes.
There are ways that someone who knows the code can optimize this better. If you know that a function will always be given wither a unicode string or a byte str of type utf8, you can simply call to_bytes() on it instead of doing to_bytes(to_unicode()). Similarly, if my blanket assumption about all APIs being called here needing byte str of encoding utf8, you can optimize aways some of the function calls.
Note, in case it wasn't apparent in my last comment, if my version of the patch is used, the package will need to grow a dep on python-kitchen.
Re-assigning to Nils, who we believe is looking after PK while hughsie is away. Nils, can you please take a look at this one ASAP? It's blocking the Final RC tomorrow. Thanks!
I wasn't aware that Richard is on vacation this week, but hey ;-).
I'm a bit wary of introducing a new dependency at this point (and without consulting Richard), so I'll rather go with a slightly pimped version of Kevin's patch, i.e. let it accept any object that can be converted to (byte) strings or unicode, then convert it to UTF-8 encoded strings and return that. We can always let it use the kitchen module instead later on.
Here's the pimped version of _to_utf8() I plan to commit:
--- 8< ---
'''convert practically anything to a utf-8-encoded byte string'''
# convert to unicode object
if isinstance(txt, str):
txt = txt.decode('utf-8', errors='replace')
if not isinstance(txt, basestring):
# try to convert non-string objects like exceptions
# if txt.__unicode__() exists, or txt.__str__() returns ASCII
txt = unicode(txt)
# if txt.__str__() exists
txt = str(txt).decode('utf-8', errors='replace')
# no __str__(), __unicode__() methods, use representation
txt = unicode(repr(txt))
# return encoded as UTF-8
return txt.encode('utf-8', errors='replace')
--- >8 ---
I've noticed that the packagekit python module also uses unicode objects instead of UTF-8 encoded strings, but for the moment I'll assume this is in order and works.
I still don't understand why we need to convert strings which are already UTF-8 byte strings to Unicode first, only to convert them back to UTF-8 right afterwards. (For strings which are in some non-UTF-8 byte encoding, txt.decode('utf-8', errors='replace') won't work anyway.)
(In reply to comment #42)
> I still don't understand why we need to convert strings which are already UTF-8
> byte strings to Unicode first, only to convert them back to UTF-8 right
> afterwards. (For strings which are in some non-UTF-8 byte encoding,
> txt.decode('utf-8', errors='replace') won't work anyway.)
At the beginning of the function, we don't know whether an str object is encoded in UTF-8 or something else. Attempting to decode it as UTF-8 and replacing characters which aren't with those funny question marks is really the best thing we can do at this point with the information we have.
I'd rather not "optimize" that function as you seem to suggest -- if we'd simply return the same str object it might be still encoded in something else than UTF-8, decoding it to unicode (with errors='replace') and re-encoding it again ensures that regardless what we give back, it's UTF-8.
Hmm, it seems simply using UTF-8 encoded strs everywhere breaks other stuff (which expects unicode objects). Need to find all places where the python side passes on string data to the C-side and convert there.
Having discussed this with Toshio on IRC, I think an easier way might be to use or copy kitchen.text.converters.getwriter() and wrap sys.stdout with it in yumBackend. I'll work on that tomorrow.
We really could do with this before tomorrow. *Today* is the scheduled day for the RC compose. We need all blockers addressed to do the RC compose. Tomorrow is delaying the schedule. thanks!
nils: any news on this today? Thanks!
Sorry for the silence. Unfortunately I ran into problems yesterday which took me the whole day to tackle. I seem to have found a solution in the evening -- I saw some pretty umlauts in kpackagekit instead of question marks -- but by that time my brain felt like a bowl of grits and I wasn't confident enough to commit and build.
Meanwhile I've done a few tests on F-15 and F-16 and I'll build patched packages shortly.
PackageKit-0.6.19-3.fc16 has been submitted as an update for Fedora 16.
Update from comment #49 was fixed this bug.
I've added karma.
Can people please check it doesn't regress GNOME too?
Fedora Bugzappers volunteer triage team
I'm not sure. But my gpk-application with new PackageKit packages works fine.
Groups of packages shown correct (screenshot with LANG=ru_RU.UTF-8 http://itmages.ru/image/view/314379/5440ea34).
I try install, remove packages. All works fine.
The fix should be in TC3:
if people can check and verify TC3 behaves correctly, we can set this to VERIFIED. thanks!
the bug has two reports that the update fixes the bug, and the update is in TC3, so I'm going to set this VERIFIED.
PackageKit-0.6.19-3.fc16 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report.
*** Bug 750001 has been marked as a duplicate of this bug. ***