Bug 1693751 - rpm.labelCompare expects strings, but RPM headers are bytes
Summary: rpm.labelCompare expects strings, but RPM headers are bytes
Status: MODIFIED
Alias: None
Product: Fedora
Classification: Fedora
Component: rpm
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Panu Matilainen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords:
Depends On: 1693762 1693766 1693768 1693771 1693773 1693774 1693787 1722868 1693759 1693760 1693767 1693788 1713107
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-28 14:57 UTC by Panu Matilainen
Modified: 2019-07-01 08:48 UTC (History)
15 users (show)

(edit)
Clone Of:
(edit)
Last Closed:


Attachments (Terms of Use)

Description Panu Matilainen 2019-03-28 14:57:45 UTC
Description of problem:
With Python 3, the contents of RPM headers are returned as bytes and thus cannot be fed back to functions like rpm.labelCompare which expects strings.

Version-Release number of selected component (if applicable):
All released versions of rpm, including python3-rpm-4.14.2

How reproducible:
always

Steps to Reproduce:
import rpm
required_version = ('0', '4.8.0', '1')
transaction_set = rpm.TransactionSet()
db_result = transaction_set.dbMatch('name', 'rpm')
package = list(db_result)[0]
print(rpm.labelCompare((package['epoch'], package['version'], package['release']), required_version))

Actual results:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: argument 1, item 1 must be str or None, not bytes

Expected results:
the result of labelCompare() is printed

Additional info:
rpm.labelCompare() is just the tip of the iceberg, this type mismatch is present all over the rpm-python API and makes the bindings unusable and fundamentally incompatible with all the rpm-related scripts ever written.

This has been fixed upstream by making the python3 bindings return ALL string data as surrogate-escaped utf-8 string objects to make things "just work" for the normal case while allowing non-utf8 data to still be handled. For further rationale see the upstream commit message:

https://github.com/rpm-software-management/rpm/commit/84920f898315d09a57a3f1067433eaeb7de5e830 

This is the Fedora side counterpart of bug 1631292, to be used as a tracking bug for known incompatibilities and other possible blockers for making this change.

Comment 1 Panu Matilainen 2019-04-10 09:29:31 UTC
This change is now active in rawhide as of rpm-4.14.2.1-6.fc31.

As a temporary crutch to allow existing users to migrate to the new behavior, the strings returned by rpm will have a fake .decode() method attached that will issue the following warning pointing to this bug if used:

    UnicodeWarning: decode() called on unicode string, see https://bugzilla.redhat.com/show_bug.cgi?id=1693751

The evil crutch will be removed once critical users have been adapted to the new behavior. Thank you for your attention.

Comment 2 Jan Pazdziora 2019-04-15 10:09:45 UTC
Thanks for the warning message. It does not however seem to catch

  p_nevra = b"%s-%s-%s.%s" % (p["name"], p["version"], p["release"], p["arch"])

which failes with

  TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

Should the compatibility layer implement __bytes__ as well?

Comment 3 Jan Pazdziora 2019-04-15 10:10:59 UTC
I also wonder how decoding to str already in the python3-rpm will affect working with ancient rpms that may have ISO-8859-1 values, or in general random content there.

Comment 4 Panu Matilainen 2019-04-15 11:06:01 UTC
The returned string is surrogate-escaped so it can handle fairly arbitrary data, and in the unlikely event that you actually know the original encoding you *can* get back to that:

>>> import rpm
>>> str = u'älämölö'
>>> b = str.encode('iso-8859-1')
>>> b
b'\xe4l\xe4m\xf6l\xf6'
>>> h = rpm.hdr()
>>> h['group'] = b
>>> d = h['group']
>>> d
'\udce4l\udce4m\udcf6l\udcf6'
>>> t = bytes(d, 'utf-8', 'surrogateescape')
>>> t
b'\xe4l\xe4m\xf6l\xf6'
>>> t.decode('iso-8859-1')
'älämölö'

But in reality, unless the header has an explicit 'encoding' tag there's no way to know what that encoding is, and the only value that the encoding tag can be is 'utf-8' in packages that have been validated to consist only of utf-8 strings. In practise all API users I've looked into have just had hardcoded .decode('utf-8') calls in place anyway so they'd fail anyway.

As for __bytes__, we'd prefer keeping the evil compat trickery to absolute bare minimum, both in content- and time-wise. But lets see how if goes, if that's actually a common pattern then maybe it makes sense to support it too.

Comment 5 Pavel Raiskup 2019-04-19 17:32:06 UTC
JFTR, those who so far expected that rpm returns bytes are probably
already doing `header.decode('utf-8')`.  But since rpm changed this to
return 'str', any attempt to `str.decode('utf-8')` raise error for them.

So it will likely mean python traceback for anyone who used this interface
before (example [1] against python-rpkg, detected in [2]).

[1] https://pagure.io/rpkg/pull-request/439
[2] https://pagure.io/copr/copr/issue/677

Comment 6 Miro Hrončok 2019-04-22 11:10:43 UTC
The warning workaround is somewhat naive. This is the problem we are hitting in FedoraReview:

https://pagure.io/FedoraReview/issue/356

Here comes the problem:

>>> import rpm
>>> required_version = ('0', '4.8.0', '1')
>>> transaction_set = rpm.TransactionSet()
>>> db_result = transaction_set.dbMatch('name', 'rpm')
>>> package = list(db_result)[0]

package['version'] is now unicode string:

>>> package['version']
'4.14.2.1'

But the decoding is possible, with warning:


>>> package['version'].decode('utf-8')
__main__:1: UnicodeWarning: decode() called on unicode string, see https://bugzilla.redhat.com/show_bug.cgi?id=1693751
'4.14.2.1'


However anything else is still a problem:

>>> package['version'].split(b'.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be str or None, not bytes




I like the change to Unicode in general, however the "backwards compatibility hack" is not complete and this will break things.
This is a serious backward incompatible API change done in a minor version bump.

I for example have no idea how to write FedoraReview code that is both Fedora 29 and Fedora 31 compatible.
One idea is to write wrappers everywhere or to always decode (with the warning) first, both are horrible.


Is there a flag I can use to get this behavior on older Fedoras? See for example how python-ldap handles this:

https://www.python-ldap.org/en/latest/bytes_mode.html#the-bytes-mode

CCing Petr who might give more insight about more backwards compatible ways of handling this.

Comment 7 Miro Hrončok 2019-04-22 11:34:02 UTC
Other code that will break, comparisons:

  archs[0].lower() == b"noarch"

Comment 8 Robert-André Mauchin 2019-04-22 13:49:27 UTC
This breaks fedpkg too, I can't import srpm for example.

Comment 9 Panu Matilainen 2019-04-23 12:17:43 UTC
This is of course a drastic change and not something you'd even dream of doing outside major version bump normally, but in this case the API has been buggy all along and the change is needed to set things straight. There would be other ways to "fix" the API, but this is the only thing that makes sense going forward.

And sure the compat hack is incomplete, there's no way to make such a change work everywhere. The .decode() hack just happens to make a whole bunch of users "just work" with the warning, which is why it's there, and there are a thing or two that could be done to further help immediate compatibility, BUT: since these hacks also *introduce* some problems with compatibility detection, we'd like to get rid of it as soon as at all possible.

We could add some sort of flag for enabling headers to return bytes again. We could (and would like to) also introduce these things in older Fedora releases, just need to figure out how to go about it. Doing this in rawhide now is bit of a "see what breaks" exercise because there's no way to find out just by looking at sources (which I did)

Comment 10 Miro Hrončok 2019-04-23 12:27:25 UTC
I'd argue that rawhide should not be the place to land change unannounced and wait what happens.
IMHO this deserves some kind of proper communication - at last a HEADS UP e-mail to devel-announce, if not a Fedora 31 change proposal baked by FESCo.
FedoraReview didn't even get a depends on bugzilla report like some others did.


Either way, rather than having a flag that enables bytes behavior on rawhide, I'd appreciate a flag on EL7, Fedora 29 and Fedora 30 that can give me the new behavior, so we can finally stop using the (indeed horrible) bytest API. Is that reasonable to expect? How can I help with the "figure out how to go about it" part?

Comment 11 Panu Matilainen 2019-04-30 05:48:45 UTC
(In reply to Miro Hrončok from comment #10)
> I'd argue that rawhide should not be the place to land change unannounced
> and wait what happens.
> IMHO this deserves some kind of proper communication - at last a HEADS UP
> e-mail to devel-announce, if not a Fedora 31 change proposal baked by FESCo.
> FedoraReview didn't even get a depends on bugzilla report like some others
> did.

I generally agree about communicating such changes but this is ultimately a bugfix in a dark corner that few people use in practise, and those few affected were supposed to get a heads-up via the bug. Apologies for missing fedora-review, not sure what happened there, maybe accidentally skipped during mass-filing.

Anyway, this will soon get a wider audience with rpm 4.15 change proposal, consider this a test-drive for the feature: we're in fairly good shape now in that some of the most critical bits (anaconda and mock) have already been adjusted to the new behavior.

> Either way, rather than having a flag that enables bytes behavior on
> rawhide, I'd appreciate a flag on EL7, Fedora 29 and Fedora 30 that can give
> me the new behavior, so we can finally stop using the (indeed horrible)
> bytest API. Is that reasonable to expect? 

Landing changes into Fedora is not a problem (and planned, in a vague manner), EL7 is a different story entirely.

> How can I help with the "figure out how to go about it" part?

The devil is in the details.

If we add a forward-compat flag to older releases now, seems to me it actually just complicates the situation for callers because we dont really want such a flag in upstream so the presence (or lack of thereof) the flag doesn't mean a thing, and then it's also unlikely that we can introduce such a change into EL7 so people would still need to deal with both API styles. Not to mention all the other rpm versions in the wild (Suse, Mageia etc) - those broken Py3 bindings will be haunting people for quite some time still no matter what.

Concrete ideas on how to make this somehow sensible for the API users are most certainly welcome.

Comment 12 Miro Hrončok 2019-04-30 09:06:24 UTC
> Landing changes into Fedora is not a problem (and planned, in a vague manner), EL7 is a different story entirely.

If this is patchable on the EPEL python3-rpm package level (without the need to patch the RHEL7 rpm package itself), it should be good.

Comment 13 Panu Matilainen 2019-04-30 09:22:59 UTC
Ahem. Right. There is no python3-rpm in EL7 itself :D
In that case, no problem.

Comment 14 Miro Hrončok 2019-04-30 09:57:22 UTC
As a selfish maintainer of Fedroa/EPEL only software, this is the pseudo API I'd like to have everywhere:

    rpm.TransactionSet(text_type=str/bytes/(not set))


This is what the API would do:


On Fedora 30 and earlier:

 no text_type supplied: fallbacks to bytes, no change

 text_type=str: get's the new behavior, but no str.decode() method

 text_type=bytes: maintains the previous behavior


On Fedora 31 and newer:

 no text_type supplied: fallbacks to unicode strings with your custom str.decode() method (possibly fallbacks to bytes until 4.15 with DeprecationWarning)

 text_type=str: get's the new behavior, but no str.decode() method

 text_type=bytes: maintains the previous behavior, with DeprecationWarning

Comment 15 Panu Matilainen 2019-04-30 11:17:00 UTC
Thanks for the feedback. However the transaction set is not the right place, it has nothing to do with the data really and more importantly, headers do exist outside transaction sets. 

It'd have to be either per-header (which seems really painful), or a global module-level thing that only affects data returned from headers. So it'd probably have to be a "rpm.header_string_type" kind of thing with defaults similar to what you describe,  but it's still ugly and something upstream we wouldn't want upstream at all because it doesn't really make any sense there (having a real bytes-mode for the whole api might make some sense, but that'd be quite different from what the existing broken strings-as-bytes behavior is and wouldn't help compatibility)

Comment 16 Miro Hrončok 2019-04-30 11:24:21 UTC
> headers do exist outside transaction sets

I wasn't sure about this, although it crossed my mind.


> It'd have to be either per-header (which seems really painful)

Very painful.


> So it'd probably have to be a "rpm.header_string_type" kind of thing

Probably yes.


> something we wouldn't want upstream at all because it doesn't really make any sense there

I cannot help with that, ultimately, that's your choice.

Comment 17 Miro Hrončok 2019-05-27 08:42:25 UTC
Panu, what is the supported way to detect this feature early (e.g. without data)?

I want to do:


if <rpm returns unicode strings>:
    spec = rpm.spec
    hdr = rpm.hdr
else:
    class MyUnicodeSpecWrapper(rpm.spec):
        ...

    class MyUnicodeHdrWrapper(rpm.hdr):
        ...

    spec = MyUnicodeSpecWrapper
    hdr = MyUnicodeHdrWrapper

Comment 18 Panu Matilainen 2019-05-28 06:46:44 UTC
This will return True on versions where strings are strings and False on the broken python3-versions returning strings as bytes:

def rpm_is_sane():
    s = "aaa"
    h = rpm.hdr()
    h['name'] = s
    return (h['name'] == s)

This should work on any rpm version >= 4.8.0, if you need to support older, catch the TypeError from header instance creation and return True in that case (those old versions are limited to python2)

Comment 19 Panu Matilainen 2019-05-28 06:48:33 UTC
Oh and FWIW, once we have rpm 4.15 in rawhide I'll try to deal with bringing this to stable versions with some sort of compat flag. Details unknown for the time being, haven't had the bandwidth to think about it so far.

Comment 20 Miro Hrončok 2019-05-28 10:53:08 UTC
Thanks.


Note You need to log in before you can comment on or make changes to this bug.