Description of problem: When running inspect_list_applications2 on SLES-11-SP4 it returns with: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 627: invalid continuation byte Version-Release number of selected component (if applicable): Libguestfs version 1.38.2 Libguestfs-Python 1.40.2 How reproducible: 100% Steps to Reproduce: 1. Install standard SLES-11-SP4 2. inspect_list_applications2 on the root Actual results: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 627: invalid continuation byte Expected results: List of applications Additional info: inspect_list_applications2 fails at: * structs.c autogenerated file * * guestfs_int_py_put_application2 * * * guestfs_int_py_fromstring (application2->app2_description); On the following installed app: PackageKit http://www.polarhome.com/service/man/?qf=packagekitd&tf=2&of=SuSE&sf=8 The following unicode sequence is giving utf-8 a problem: ''' Backend: pisi S.A~Xağlar Onur <caglar.tr> ''' Notice the accented 'G' There are number of ways to fix this 1. If any of the functions in inspect_list_applications2 fail - do not fail the entire function 2. Return the description as bytes not string 3. This is what I did: replace PyUnicode_FromString(str) with PyUnicode_DecodeLocale(str, "surrogateescape") in handle.c - it escapes invalid characters but keeps the result as utf-8 string as intended. The output now contains "S.�ağlar Onur" instead. Thoughts? Sam
?
Python bindings are broken with respect to handling improperly encoded strings. There is another BZ about this but I can't find it right now ...
Here we go: https://bugzilla.redhat.com/show_bug.cgi?id=1661871 See if the fix for that fixes this issue.
(In reply to Richard W.M. Jones from comment #3) > Here we go: > https://bugzilla.redhat.com/show_bug.cgi?id=1661871 This was a different issue. See this old series: https://www.redhat.com/archives/libguestfs/2017-May/msg00076.html (v1 is no more needed, https://github.com/libguestfs/libguestfs/commit/0ee02e0117527b86a31b2a88a14994ce7f15571f was a better fix for it)
This has nothing to do to with PyBytes_FromStringAndSize. The problem is that PyUnicode_FromString (which is the correct function to call in python3) fails on a package description string that contains invalid unicode (actually utf-8) symbols. I used PyUnicode_DecodeLocale(str, "surrogateescape") as a temporary fix. Sam
More on the issue of "surroageescape": https://www.python.org/dev/peps/pep-0383/ Sam
Gentle ping (new email)
Nothing's really going to happen until someone submits a fix for the Python bindings string handling.
(In reply to Richard W.M. Jones from comment #8) > Nothing's really going to happen until someone submits a fix for > the Python bindings string handling. True, But what do you think the fix should be? Treat app2_description as utf8 string (even when it is not, can be latin1 or anything else) and use PyUnicode_DecodeLocale(str, "surrogateescape") - to skip errors. Or should we change app2_description to FBytes in generator/structs.ml - and treat it as bytes? In my opinion we should treat the description as utf8 and use surrogateesacpe, since the desctiption should be a utf8 string and the bug is actually in PackageKit. (As can be seen, they fixed it in recent versions: https://github.com/hughsie/PackageKit/commit/7a92f842830e3ea9122463fe279f0b42150cbd63) Sam
Soryy(In reply to Sam Eiderman from comment #9) > (In reply to Richard W.M. Jones from comment #8) > > Nothing's really going to happen until someone submits a fix for > > the Python bindings string handling. > > True, > > But what do you think the fix should be? > Treat app2_description as utf8 string (even when it is not, can be latin1 or > anything else) and use PyUnicode_DecodeLocale(str, "surrogateescape") - to > skip errors. > Or should we change app2_description to FBytes in generator/structs.ml - and > treat it as bytes? > > In my opinion we should treat the description as utf8 and use > surrogateesacpe, since the desctiption should be a utf8 string and the bug > is actually in PackageKit. > (As can be seen, they fixed it in recent versions: > https://github.com/hughsie/PackageKit/commit/ > 7a92f842830e3ea9122463fe279f0b42150cbd63) > > Sam Sorry, I think PyUnicode_DecodeLocale isn't appropriate (uses the current locale instead utf-8), maybe we should just return an empty string if PyUnicode_FromString fails?