Bug 1684004 - UnicodeError in inspect_list_applications2
Summary: UnicodeError in inspect_list_applications2
Keywords:
Status: NEW
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libguestfs
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Richard W.M. Jones
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-28 09:22 UTC by Sam
Modified: 2020-05-05 21:34 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Sam 2019-02-28 09:22:47 UTC
Description of problem:

When running inspect_list_applications2 on SLES-11-SP4 it returns with:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 627: invalid continuation byte


Version-Release number of selected component (if applicable):

Libguestfs version 1.38.2
Libguestfs-Python 1.40.2


How reproducible:
100%


Steps to Reproduce:
1. Install standard SLES-11-SP4
2. inspect_list_applications2 on the root

Actual results:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 627: invalid continuation byte


Expected results:
List of applications


Additional info:

inspect_list_applications2 fails at:
* structs.c autogenerated file
* * guestfs_int_py_put_application2
* * * guestfs_int_py_fromstring (application2->app2_description);

On the following installed app:
PackageKit
http://www.polarhome.com/service/man/?qf=packagekitd&tf=2&of=SuSE&sf=8

The following unicode sequence is giving utf-8 a problem:
'''
Backend: pisi
	       S.A~Xağlar Onur <caglar.tr>
'''
Notice the accented 'G'

There are number of ways to fix this
1. If any of the functions in inspect_list_applications2 fail - do not fail the entire function
2. Return the description as bytes not string
3. This is what I did: replace PyUnicode_FromString(str) with PyUnicode_DecodeLocale(str, "surrogateescape") in handle.c - it escapes invalid characters but keeps the result as utf-8 string as intended. The output now contains "S.�a&#287;lar Onur" instead.

Thoughts?

Sam

Comment 1 Sam 2019-03-14 12:13:36 UTC
?

Comment 2 Richard W.M. Jones 2019-04-09 08:14:33 UTC
Python bindings are broken with respect to handling improperly encoded strings.
There is another BZ about this but I can't find it right now ...

Comment 3 Richard W.M. Jones 2019-04-09 08:15:41 UTC
Here we go:
https://bugzilla.redhat.com/show_bug.cgi?id=1661871

See if the fix for that fixes this issue.

Comment 4 Pino Toscano 2019-04-09 08:28:34 UTC
(In reply to Richard W.M. Jones from comment #3)
> Here we go:
> https://bugzilla.redhat.com/show_bug.cgi?id=1661871

This was a different issue.

See this old series:
https://www.redhat.com/archives/libguestfs/2017-May/msg00076.html
(v1 is no more needed, https://github.com/libguestfs/libguestfs/commit/0ee02e0117527b86a31b2a88a14994ce7f15571f was a better fix for it)

Comment 5 Sam 2019-04-09 11:44:15 UTC
This has nothing to do to with PyBytes_FromStringAndSize.

The problem is that PyUnicode_FromString (which is the correct function to call in python3) fails on a package description string that contains invalid unicode (actually utf-8) symbols. I used PyUnicode_DecodeLocale(str, "surrogateescape") as a temporary fix.

Sam

Comment 6 Sam 2019-04-09 11:53:18 UTC
More on the issue of "surroageescape":

https://www.python.org/dev/peps/pep-0383/

Sam

Comment 7 Sam Eiderman 2019-09-08 08:43:24 UTC
Gentle ping

(new email)

Comment 8 Richard W.M. Jones 2019-09-08 10:42:26 UTC
Nothing's really going to happen until someone submits a fix for
the Python bindings string handling.

Comment 9 Sam Eiderman 2019-09-08 11:47:05 UTC
(In reply to Richard W.M. Jones from comment #8)
> Nothing's really going to happen until someone submits a fix for
> the Python bindings string handling.

True,

But what do you think the fix should be?
Treat app2_description as utf8 string (even when it is not, can be latin1 or anything else) and use PyUnicode_DecodeLocale(str, "surrogateescape") - to skip errors.
Or should we change app2_description to FBytes in generator/structs.ml - and treat it as bytes?

In my opinion we should treat the description as utf8 and use surrogateesacpe, since the desctiption should be a utf8 string and the bug is actually in PackageKit.
(As can be seen, they fixed it in recent versions: https://github.com/hughsie/PackageKit/commit/7a92f842830e3ea9122463fe279f0b42150cbd63)

Sam

Comment 10 Sam Eiderman 2019-09-08 11:53:45 UTC
Soryy(In reply to Sam Eiderman from comment #9)
> (In reply to Richard W.M. Jones from comment #8)
> > Nothing's really going to happen until someone submits a fix for
> > the Python bindings string handling.
> 
> True,
> 
> But what do you think the fix should be?
> Treat app2_description as utf8 string (even when it is not, can be latin1 or
> anything else) and use PyUnicode_DecodeLocale(str, "surrogateescape") - to
> skip errors.
> Or should we change app2_description to FBytes in generator/structs.ml - and
> treat it as bytes?
> 
> In my opinion we should treat the description as utf8 and use
> surrogateesacpe, since the desctiption should be a utf8 string and the bug
> is actually in PackageKit.
> (As can be seen, they fixed it in recent versions:
> https://github.com/hughsie/PackageKit/commit/
> 7a92f842830e3ea9122463fe279f0b42150cbd63)
> 
> Sam

Sorry, I think PyUnicode_DecodeLocale isn't appropriate (uses the current locale instead utf-8), maybe we should just return an empty string if PyUnicode_FromString fails?


Note You need to log in before you can comment on or make changes to this bug.