Bug 668282

Summary: PackageKit yum backend uses incorrect encoding for dynamic category names, makes them show up with '?' characters in KPackageKit
Product: [Fedora] Fedora Reporter: A S Alam <aalam>
Component: PackageKitAssignee: Nils Philippsen <nphilipp>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 16CC: a.badger, alekcejk, atigro, awilliam, balajig81, ffesti, hughsient, i18n-bugs, james.antill, jonathan, kevin, ltinkl, maxamillion, me, mshao, muertosfx, pahan, pierre.juhen, pmatilai, rdieter, rhughes, richard, rvitale, smparrish, tla, zpavlas
Target Milestone: ---Keywords: i18n, Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AcceptedBlocker
Fixed In Version: PackageKit-0.6.19-3.fc16 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-10-29 02:30:59 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 713568    
Attachments:
Description Flags
screenshot for application
none
gpk application is showing same text
none
KPackageKit screenshot
none
test patch
none
UNTESTED patch (should be more complete than Richard's)
none
UNTESTED patch that adds more robustness on top of Kevin's fix none

Description A S Alam 2011-01-09 14:03:26 EST
Created attachment 472457 [details]
screenshot for application

Description of problem:
Rendering for Group name in 'Add Remove Software', where gpk-application showing those name correctly (data is UTF-8).

Version-Release number of selected component (if applicable):
kpackagekit-0.6.3.2-2.fc14.x86_64

How reproducible:
Everytime

Steps to Reproduce:
1. run kpackagekit in Punjabi (pa_IN)
2. Add/Remove Software -> Categories->Application (or any other)
3.
  
Actual results:
showing ???? characters instead of actual text

Expected results:
actual Punjabi text must be shown 

Additional info:
Click on History, it is showing text properly
Comment 1 A S Alam 2011-01-09 14:06:31 EST
Created attachment 472458 [details]
gpk application is showing same text
Comment 2 Kevin Kofler 2011-01-09 16:49:23 EST
> Rendering for Group name in 'Add Remove Software', where gpk-application
> showing those name correctly (data is UTF-8).

Already known, discussed on #fedora-kde on IRC.

It's a bug in PackageKit's yum backend, it's handling the encoding incorrectly. Richard Hughes is already aware of the issue, it should get fixed shortly.

> gpk application is showing same text

That's not the same text. gnome-packagekit uses the hardcoded groups directly from the PackageKit core, KPackageKit uses dynamic categories which are returned by the backend.

The advantage of dynamic categories is that you get the full list of categories from comps, not just a subset.
Comment 3 Richard Hughes 2011-01-10 04:33:30 EST
The big is fixed by this:

commit 7a92f842830e3ea9122463fe279f0b42150cbd63
Author: Richard Hughes <richard@hughsie.com>
Date:   Tue Jan 4 12:39:37 2011 +0000

    yum: Ensure the category data is valid UTF8

:100755 100755 0edce6d... 84985af... M  backends/yum/yumBackend.py

Basically, yum isn't always emitting valid utf8 for category names for some locales. I'll push a new PK release out in a few weeks that fixes this.  If someone wants to build an update for F14 and submit it to bohdi, you have my blessing, but I've got a TODO list longer than my arm this week. Sorry.
Comment 4 Kevin Kofler 2011-01-12 01:24:40 EST
Thanks, I'm taking care of this.
Comment 5 Fedora Update System 2011-01-12 02:19:19 EST
PackageKit-0.6.11-2.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/PackageKit-0.6.11-2.fc14
Comment 6 Fedora Update System 2011-01-13 13:00:16 EST
PackageKit-0.6.11-2.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update PackageKit'.  You can provide feedback for this update here: https://admin.fedoraproject.org/updates/PackageKit-0.6.11-2.fc14
Comment 7 A S Alam 2011-01-13 22:29:49 EST
I tried by update packages: 
PackageKit and kpackagekit, but still I am getting "????" characters only.

(going to try again by rebooting machine)
Comment 8 Fedora Update System 2011-01-17 15:53:41 EST
PackageKit-0.6.11-2.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 9 Kevin Kofler 2011-01-20 16:37:04 EST
rhughes: Your fix does not work!

I backported the upstream commit you pointed me to, it got pushed, I upgraded to it myself, but after a reboot, I still get the same old question marks in KPackageKit.

In comment #7, A S Alam also says the fix doesn't work for him. (FWIW, next time give negative karma, so you'd have compensated that overzealous tester who gave it a +1 without actually testing the bug. I should have canceled the push, but since you said you were "going to try again by rebooting machine" and you didn't post anything after that, I assumed that a reboot fixed it for you.)
Comment 10 Richard Hughes 2011-01-20 18:00:00 EST
(In reply to comment #9)
> I backported the upstream commit you pointed me to, it got pushed, I upgraded
> to it myself, but after a reboot, I still get the same old question marks in
> KPackageKit.

Hmm. If that's the case, I apologize, but it certainly fixed it for me. Did you try a "pkcon refresh" after installing the update?
Comment 11 Kevin Kofler 2011-01-20 18:40:48 EST
I've just done one now, it didn't help.
Comment 12 A S Alam 2011-01-21 01:19:39 EST
(In reply to comment #9)
> 
> In comment #7, A S Alam also says the fix doesn't work for him. (FWIW, next
> time give negative karma, so you'd have compensated that overzealous tester who
> gave it a +1 without actually testing the bug. I should have canceled the push,
> but since you said you were "going to try again by rebooting machine" and you
> didn't post anything after that, I assumed that a reboot fixed it for you.)

Next time I will take care and provide -1 karma. (this time I was trying to overcome another issue (lokalize crash and junk charater) and for that I did various random things, so I was not sure about my results whether those are correct or not. After that upgrade to rawhide to work with lokalize, so not update with 'reboot' result).
Comment 13 Kevin Kofler 2011-01-21 03:07:28 EST
Don't worry, we didn't actually break anything, the only harm done is the unnecessary update. And it's mostly my fault (but I blame the policies: with the old system, I'd just have kept karma automatism disabled and decided myself whether the feedback was sufficient or not; with the new system, if I do that, I have to wait a full week to push my updates no matter how good the feedback was, so there's strong pressure to enable karma automatism with all its brokenness).
Comment 14 Kevin Kofler 2011-03-05 13:13:06 EST
Ping? Any chance we can get a fix for this which actually works?
Comment 15 Richard Hughes 2011-03-07 04:32:17 EST
Hmm, I can't reproduce. Is there a good way to reproduce this on a F15 system? Thanks.
Comment 16 Pierre Juhen 2011-03-07 16:42:20 EST
I have the same problem in French.

Looking around, I tried to play with QTextCodec, without success.

Another little issue is that Categories are displayed in English, whilst Groups are displayed in French, but with special characters displayed as "?".
Comment 17 Alexei Panov 2011-04-27 12:37:00 EDT
I confirm this bug for Russian language too.
Comment 18 Alexei Panov 2011-04-28 02:40:15 EDT
Created attachment 495417 [details]
KPackageKit screenshot
Comment 19 muertos 2011-04-28 10:47:43 EDT
I confirm this bug for Russian language.
Comment 20 Lukáš Tinkl 2011-05-27 11:50:13 EDT
Any update on this? I can confirm this on F15 + Czech language as well
Comment 21 Kevin Kofler 2011-10-20 03:32:54 EDT
This also affects other strings, e.g. update details. It does not happen with the zif backend, so this is definitely an issue in the yum backend.
Comment 22 Kevin Kofler 2011-10-20 03:34:04 EDT
*** Bug 676488 has been marked as a duplicate of this bug. ***
Comment 23 Kevin Kofler 2011-10-20 03:35:05 EDT
The duplicate Bug 676488 was accepted as a release blocker, so I guess this one should be marked as a blocker now.
Comment 24 Richard Hughes 2011-10-20 04:51:29 EDT
Okay, so there's something odd going on. In yumBackend.py we have (line 705):

name = grp.nameByLang(self.lang)

So it should mean we can return valid UTF8 by making it:

name = _to_unicode(grp.nameByLang(self.lang))

...like we do in so many other places. But alas, this seems not to work, and packagekitd still emits lots of text 'Son et vid�o' was not valid UTF8! and refuses (correctly) to pass this on to the clients.

So, if we debug this a little, we find that the clause:

if not isinstance(name, unicode)

...fails in _to_unicode(), so it basically looks like yum is sending us a string that purports to be unicode, but contains invalid encoding.

We can't even do:

name = grp.nameByLang(self.lang)
name = unicode(name, 'utf-8', errors='replace')

As python (again, correctly) refuses to decode unicode like that.

To double check, if we do:

print name.decode('utf-8')

... we get "Error Value: 'ascii' codec can't encode character u'\xe9' in position 23: ordinal not in range(128)"

So, unless I'm misunderstanding something, or unless yum is emitting this as valid unicode (but UCS-2 or UTF-16, which would be very odd) I think this is probably a yum / python bug.

I'll reassign it for comments.
Comment 25 Zdeněk Pavlas 2011-10-20 09:03:27 EDT
> So it should mean we can return valid UTF8 by making it:
> name = _to_unicode(grp.nameByLang(self.lang))

nameByLang() returns unicode strings, so _to_unicode is a no-op here.
name.encode('UTF8') should return UTF8 encoded string.
Comment 26 Richard Hughes 2011-10-20 09:10:08 EDT
(In reply to comment #25)
> nameByLang() returns unicode strings, so _to_unicode is a no-op here.
> name.encode('UTF8') should return UTF8 encoded string.

Thanks Zdeněk,

Do we know what encoding it is already? If .encode() is what I should be doing, can you suggest a patch to https://gitorious.org/packagekit/packagekit/blobs/master/backends/yum/yumBackend.py#line122 to make it work on this kind of string too please? Thanks.

Richard
Comment 27 Zdeněk Pavlas 2011-10-20 11:39:53 EDT
Python unicode string is internally UCS2 or UCS4 (configurable at build time), we don't want to touch that directly.  

_to_unicode() is fine, it just converts non-unicode string to unicode, interpreting it as UTF8.  The bug is probably somewhere else- 'unicode' being converted to 'gchar*' incorrectly.
Comment 28 Adam Williamson 2011-10-20 12:11:56 EDT
dupe was an acceptedblocker, so this should be too.
Comment 29 James Antill 2011-10-20 12:18:28 EDT
> So, if we debug this a little, we find that the clause:
> 
> if not isinstance(name, unicode)
> 
> ...fails in _to_unicode(), so it basically looks like yum is sending us a
> string that purports to be unicode, but contains invalid encoding.

 I'll try and help, as I have to deal with this confusion a lot in yum. What Zdeněk said is correct ... if the thing you have is "isinstance(name, unicode)" then you should think of it as a unicode object, it "can't" be bad they are all the same etc.

> To double check, if we do:
> 
> print name.decode('utf-8')
> 
> ... we get "Error Value: 'ascii' codec can't encode character u'\xe9' in
> position 23: ordinal not in range(128)"

 One, _very_, confusing thing is that both str() and unicode() objects have .decode() and .encode() methods ... and coming from the C world I tend to think of "utf-8" as the natural type which should be encoded to python's internal "unicode" object, but that's exactly the opposite of what you should do.
 This also means you can do:

"⅓" (fine a bytes object with utf-8 in it)
"⅓".decode('utf-8') (also fine, but now a unicode object)
"⅓".decode('utf-8').decode('utf-8') (raises: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2153' in position 0: ordinal not in range(128)).

...which is nice.
 Which is why we use to_unicode() and to_utf8() everywhere in yum.

 The last bit, which is likely where the error is coming from is that (because python defaults to the ascii codec):

"⅓" + "a" (fine)
"a" + "⅓" (fine)
"⅓".decode('utf-8') + "a" (fine)
"a" + "⅓".decode('utf-8') (fine)
"⅓".decode('utf-8') + "⅓".decode('utf-8') (fine)

"⅓".decode('utf-8') + "⅓" (raises)
"a".decode('utf-8') + "⅓" (raises)
Comment 30 Kevin Kofler 2011-10-20 13:16:51 EDT
So this means to_unicode is exactly the wrong thing to use in PackageKit-yum, it should be using to_utf8 instead, right?
Comment 31 Zdeněk Pavlas 2011-10-21 03:15:44 EDT
> it should be using to_utf8 instead, right?

Likely yes.  Python does not mix utf8 and unicode well.  Possible solutions are:

a) Whenever some API returns python unicode object, convert it to utf8 and pretend unicode does not exist.

b) Convert everything to unicode.  'yum' uses this.

to_utf8 and to_unicode are wrappers that skip conversion if already done.

c) sys.setdefaultencoding('utf-8')

Python then understands utf8 when implicitly converting strings to unicode, so 
'unicode + utf8' works as expected.  It's effectively b) but without the need to call to_unicode explicitly.  It has issues too. http://lists.baseurl.org/pipermail/yum-devel/2011-March/008034.html

If most of the code expects utf8, using a) is probably best.
Comment 32 Richard Hughes 2011-10-21 07:03:51 EDT
Created attachment 529473 [details]
test patch

Something like this?
Comment 33 Kevin Kofler 2011-10-21 07:15:18 EDT
I think you need to call _to_utf8 on ALL strings returned by yum. I've also seen question marks in at least update descriptions and titles of update Bugzilla links.
Comment 34 Kevin Kofler 2011-10-21 07:15:38 EDT
Oh, and update RPM changelogs, too.
Comment 35 Adam Williamson 2011-10-24 12:52:52 EDT
Richard is apparently on vacation this week, so we need someone else to finish up his patch and fix this.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers
Comment 36 Kevin Kofler 2011-10-24 13:21:17 EDT
Created attachment 529928 [details]
UNTESTED patch (should be more complete than Richard's)

This (proof-of-concept) patch looks very huge, but it basically boils down to 4 things:
1. adds the _to_utf8 function from Richard's patch,
2. adds the missing _to_utf8 call added by Richard's patch,
3. does a s/_to_unicode/_to_utf8/g – AFAICT, all those calls are on strings returned by yum, before passing them to GLib API. GLib wants UTF-8, not UTF-16/32.
4. removes the no longer used _to_unicode function.


WARNING: For all I know, this might screw everything up royally. I think this should be 1. proofread by somebody more familiar with Python and 2. tested.
Comment 37 Toshio Ernie Kuratomi 2011-10-24 17:06:31 EDT
Created attachment 529967 [details]
UNTESTED patch that adds more robustness on top of Kevin's fix

Looks decent.  It seems like there's a few assumptions in the original code that you're porting that might not bear out in practice.  Here's a slightly revised patch that uses python-kitchen's to_unicode and to_bytes methods.  Also untested but it should be more robust in corner cases.  (One of which, passing an object to to_unicode/to_utf8 is sprinkled quite liberally throughout this patch :-)

The get_update_detail() function looks like it would benefit from a rewrite with a determination to not mix unicode string and byte str types together but I didn't want to get too invasive with a patch this close to the deadline.

The assumption I'm making is that the APIs that we're using here must all be utf-8 encoded bytes (neither python unicode type nor other encodings (like shift-js, big5, or latin-1) would work).  With that in mind, we need to be more pedantic about just what data we're passing through.  (Making sure to convert byte strings which are not utf-8 encoded into utf-8.)

Also, constructions like:

try:
    raise Exception('café')
except Exception, e:
  self.error('%s' % _to_unicode(e))

wouldn't do the right thing under the old code.  (the "e" variable would be an exception object so the _to_unicode()/_to_utf8() function in yumBackend.py would have passed that object through to the string format unchanged.  That could have raised an exception depending on the user's locale settings, the exception message, and whether that message was a unicode string or a byte str.)  Porting to the kitchen to_bytes() function fixes that as to_bytes() will convert an object to a str representation of the object.

Note that my changes are just to help prevent issues that were present in the original code and in the simple port by Kevin don't suddenly raise tracebacks at a later point.  The actual fixing of this particular bug looks like it would be fixed via the logical changes that Kevin's patch makes.  This is just a more robust implementation of those changes.

There are ways that someone who knows the code can optimize this better.  If you know that a function will always be given wither a unicode string or a byte str of type utf8, you can simply call to_bytes() on it instead of doing to_bytes(to_unicode()).  Similarly, if my blanket assumption about all APIs being called here needing byte str of encoding utf8, you can optimize aways some of the function calls.
Comment 38 Toshio Ernie Kuratomi 2011-10-24 17:44:03 EDT
Note, in case it wasn't apparent in my last comment, if my version of the patch is used, the package will need to grow a dep on python-kitchen.
Comment 39 Adam Williamson 2011-10-24 20:12:57 EDT
Re-assigning to Nils, who we believe is looking after PK while hughsie is away. Nils, can you please take a look at this one ASAP? It's blocking the Final RC tomorrow. Thanks!
Comment 40 Nils Philippsen 2011-10-25 05:17:01 EDT
I wasn't aware that Richard is on vacation this week, but hey ;-).

I'm a bit wary of introducing a new dependency at this point (and without consulting Richard), so I'll rather go with a slightly pimped version of Kevin's patch, i.e. let it accept any object that can be converted to (byte) strings or unicode, then convert it to UTF-8 encoded strings and return that. We can always let it use the kitchen module instead later on.
Comment 41 Nils Philippsen 2011-10-25 08:54:35 EDT
Here's the pimped version of _to_utf8() I plan to commit:

--- 8< ---
def _to_utf8(txt):
    '''convert practically anything to a utf-8-encoded byte string'''

    # convert to unicode object
    if isinstance(txt, str):
        txt = txt.decode('utf-8', errors='replace')
    if not isinstance(txt, basestring):
        # try to convert non-string objects like exceptions
        try:
            # if txt.__unicode__() exists, or txt.__str__() returns ASCII
            txt = unicode(txt)
        except UnicodeDecodeError:
            # if txt.__str__() exists
            txt = str(txt).decode('utf-8', errors='replace')
        except:
            # no __str__(), __unicode__() methods, use representation
            txt = unicode(repr(txt))

    # return encoded as UTF-8
    return txt.encode('utf-8', errors='replace')
--- >8 ---

I've noticed that the packagekit python module also uses unicode objects instead of UTF-8 encoded strings, but for the moment I'll assume this is in order and works.
Comment 42 Kevin Kofler 2011-10-25 10:21:04 EDT
I still don't understand why we need to convert strings which are already UTF-8 byte strings to Unicode first, only to convert them back to UTF-8 right afterwards. (For strings which are in some non-UTF-8 byte encoding, txt.decode('utf-8', errors='replace') won't work anyway.)
Comment 43 Nils Philippsen 2011-10-25 10:58:14 EDT
(In reply to comment #42)
> I still don't understand why we need to convert strings which are already UTF-8
> byte strings to Unicode first, only to convert them back to UTF-8 right
> afterwards. (For strings which are in some non-UTF-8 byte encoding,
> txt.decode('utf-8', errors='replace') won't work anyway.)

At the beginning of the function, we don't know whether an str object is encoded in UTF-8 or something else. Attempting to decode it as UTF-8 and replacing characters which aren't with those funny question marks is really the best thing we can do at this point with the information we have.

I'd rather not "optimize" that function as you seem to suggest -- if we'd simply return the same str object it might be still encoded in something else than UTF-8, decoding it to unicode (with errors='replace') and re-encoding it again ensures that regardless what we give back, it's UTF-8.
Comment 44 Nils Philippsen 2011-10-25 12:08:02 EDT
Hmm, it seems simply using UTF-8 encoded strs everywhere breaks other stuff (which expects unicode objects). Need to find all places where the python side passes on string data to the C-side and convert there.
Comment 45 Nils Philippsen 2011-10-25 13:16:29 EDT
Having discussed this with Toshio on IRC, I think an easier way might be to use or copy kitchen.text.converters.getwriter() and wrap sys.stdout with it in yumBackend. I'll work on that tomorrow.
Comment 46 Adam Williamson 2011-10-25 14:47:06 EDT
We really could do with this before tomorrow. *Today* is the scheduled day for the RC compose. We need all blockers addressed to do the RC compose. Tomorrow is delaying the schedule. thanks!
Comment 47 Adam Williamson 2011-10-26 17:41:16 EDT
nils: any news on this today? Thanks!
Comment 48 Nils Philippsen 2011-10-27 06:41:46 EDT
Sorry for the silence. Unfortunately I ran into problems yesterday which took me the whole day to tackle. I seem to have found a solution in the evening -- I saw some pretty umlauts in kpackagekit instead of question marks -- but by that time my brain felt like a bowl of grits and I wasn't confident enough to commit and build.

Meanwhile I've done a few tests on F-15 and F-16 and I'll build patched packages shortly.
Comment 49 Fedora Update System 2011-10-27 07:11:22 EDT
PackageKit-0.6.19-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/PackageKit-0.6.19-3.fc16
Comment 50 Alexei Panov 2011-10-27 08:47:53 EDT
Update from comment #49 was fixed this bug.
I've added karma.
Thanks!
Comment 51 Adam Williamson 2011-10-27 11:56:30 EDT
Can people please check it doesn't regress GNOME too?



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers
Comment 52 Alexei Panov 2011-10-27 12:10:58 EDT
I'm not sure. But my gpk-application with new PackageKit packages works fine.
Groups of packages shown correct (screenshot with LANG=ru_RU.UTF-8 http://itmages.ru/image/view/314379/5440ea34).
I try install, remove packages. All works fine.
Comment 53 Adam Williamson 2011-10-28 14:50:50 EDT
The fix should be in TC3:

http://dl.fedoraproject.org/pub/alt/stage/16.TC3/

if people can check and verify TC3 behaves correctly, we can set this to VERIFIED. thanks!
Comment 54 Adam Williamson 2011-10-28 18:56:00 EDT
the bug has two reports that the update fixes the bug, and the update is in TC3, so I'm going to set this VERIFIED.
Comment 55 Fedora Update System 2011-10-29 02:30:59 EDT
PackageKit-0.6.19-3.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 56 Kevin Kofler 2011-10-29 15:56:09 EDT
*** Bug 750001 has been marked as a duplicate of this bug. ***