Bug 1779194 - rpm.labelCompare expects strings, but RPM headers are bytes [NEEDINFO]
Summary: rpm.labelCompare expects strings, but RPM headers are bytes
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-dateutil
Version: 15.0 (Stein)
Hardware: x86_64
OS: All
low
low
Target Milestone: z2
: 15.0 (Stein)
Assignee: Lon Hohberger
QA Contact: nlevinki
URL:
Whiteboard:
Depends On: 1693751 1693762 1693766 1693768 1693773 1722868 1693759 1693760 1693767 1693771 1693774 1693787 1693788 1713107
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-03 13:47 UTC by David Hill
Modified: 2020-06-24 04:53 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1693751
Environment:
Last Closed:
Target Upstream Version:
dbodnarc: needinfo? (lhh)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4629561 None None None 2019-12-03 14:08:02 UTC

Description David Hill 2019-12-03 13:47:29 UTC
+++ This bug was initially created as a clone of Bug #1693751 +++

Description of problem:
With Python 3, the contents of RPM headers are returned as bytes and thus cannot be fed back to functions like rpm.labelCompare which expects strings.

Version-Release number of selected component (if applicable):
All released versions of rpm, including python3-rpm-4.14.2

How reproducible:
always

Steps to Reproduce:
import rpm
required_version = ('0', '4.8.0', '1')
transaction_set = rpm.TransactionSet()
db_result = transaction_set.dbMatch('name', 'rpm')
package = list(db_result)[0]
print(rpm.labelCompare((package['epoch'], package['version'], package['release']), required_version))

Actual results:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: argument 1, item 1 must be str or None, not bytes

Expected results:
the result of labelCompare() is printed

Additional info:
rpm.labelCompare() is just the tip of the iceberg, this type mismatch is present all over the rpm-python API and makes the bindings unusable and fundamentally incompatible with all the rpm-related scripts ever written.

This has been fixed upstream by making the python3 bindings return ALL string data as surrogate-escaped utf-8 string objects to make things "just work" for the normal case while allowing non-utf8 data to still be handled. For further rationale see the upstream commit message:

https://github.com/rpm-software-management/rpm/commit/84920f898315d09a57a3f1067433eaeb7de5e830 

This is the Fedora side counterpart of bug 1631292, to be used as a tracking bug for known incompatibilities and other possible blockers for making this change.

--- Additional comment from Panu Matilainen on 2019-04-10 09:29:31 UTC ---

This change is now active in rawhide as of rpm-4.14.2.1-6.fc31.

As a temporary crutch to allow existing users to migrate to the new behavior, the strings returned by rpm will have a fake .decode() method attached that will issue the following warning pointing to this bug if used:

    UnicodeWarning: decode() called on unicode string, see https://bugzilla.redhat.com/show_bug.cgi?id=1693751

The evil crutch will be removed once critical users have been adapted to the new behavior. Thank you for your attention.

--- Additional comment from Jan Pazdziora on 2019-04-15 10:09:45 UTC ---

Thanks for the warning message. It does not however seem to catch

  p_nevra = b"%s-%s-%s.%s" % (p["name"], p["version"], p["release"], p["arch"])

which failes with

  TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

Should the compatibility layer implement __bytes__ as well?

--- Additional comment from Jan Pazdziora on 2019-04-15 10:10:59 UTC ---

I also wonder how decoding to str already in the python3-rpm will affect working with ancient rpms that may have ISO-8859-1 values, or in general random content there.

--- Additional comment from Panu Matilainen on 2019-04-15 11:06:01 UTC ---

The returned string is surrogate-escaped so it can handle fairly arbitrary data, and in the unlikely event that you actually know the original encoding you *can* get back to that:

>>> import rpm
>>> str = u'älämölö'
>>> b = str.encode('iso-8859-1')
>>> b
b'\xe4l\xe4m\xf6l\xf6'
>>> h = rpm.hdr()
>>> h['group'] = b
>>> d = h['group']
>>> d
'\udce4l\udce4m\udcf6l\udcf6'
>>> t = bytes(d, 'utf-8', 'surrogateescape')
>>> t
b'\xe4l\xe4m\xf6l\xf6'
>>> t.decode('iso-8859-1')
'älämölö'

But in reality, unless the header has an explicit 'encoding' tag there's no way to know what that encoding is, and the only value that the encoding tag can be is 'utf-8' in packages that have been validated to consist only of utf-8 strings. In practise all API users I've looked into have just had hardcoded .decode('utf-8') calls in place anyway so they'd fail anyway.

As for __bytes__, we'd prefer keeping the evil compat trickery to absolute bare minimum, both in content- and time-wise. But lets see how if goes, if that's actually a common pattern then maybe it makes sense to support it too.

--- Additional comment from Pavel Raiskup on 2019-04-19 17:32:06 UTC ---

JFTR, those who so far expected that rpm returns bytes are probably
already doing `header.decode('utf-8')`.  But since rpm changed this to
return 'str', any attempt to `str.decode('utf-8')` raise error for them.

So it will likely mean python traceback for anyone who used this interface
before (example [1] against python-rpkg, detected in [2]).

[1] https://pagure.io/rpkg/pull-request/439
[2] https://pagure.io/copr/copr/issue/677

--- Additional comment from Miro Hrončok on 2019-04-22 11:10:43 UTC ---

The warning workaround is somewhat naive. This is the problem we are hitting in FedoraReview:

https://pagure.io/FedoraReview/issue/356

Here comes the problem:

>>> import rpm
>>> required_version = ('0', '4.8.0', '1')
>>> transaction_set = rpm.TransactionSet()
>>> db_result = transaction_set.dbMatch('name', 'rpm')
>>> package = list(db_result)[0]

package['version'] is now unicode string:

>>> package['version']
'4.14.2.1'

But the decoding is possible, with warning:


>>> package['version'].decode('utf-8')
__main__:1: UnicodeWarning: decode() called on unicode string, see https://bugzilla.redhat.com/show_bug.cgi?id=1693751
'4.14.2.1'


However anything else is still a problem:

>>> package['version'].split(b'.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be str or None, not bytes




I like the change to Unicode in general, however the "backwards compatibility hack" is not complete and this will break things.
This is a serious backward incompatible API change done in a minor version bump.

I for example have no idea how to write FedoraReview code that is both Fedora 29 and Fedora 31 compatible.
One idea is to write wrappers everywhere or to always decode (with the warning) first, both are horrible.


Is there a flag I can use to get this behavior on older Fedoras? See for example how python-ldap handles this:

https://www.python-ldap.org/en/latest/bytes_mode.html#the-bytes-mode

CCing Petr who might give more insight about more backwards compatible ways of handling this.

--- Additional comment from Miro Hrončok on 2019-04-22 11:34:02 UTC ---

Other code that will break, comparisons:

  archs[0].lower() == b"noarch"

--- Additional comment from Robert-André Mauchin on 2019-04-22 13:49:27 UTC ---

This breaks fedpkg too, I can't import srpm for example.

--- Additional comment from Panu Matilainen on 2019-04-23 12:17:43 UTC ---

This is of course a drastic change and not something you'd even dream of doing outside major version bump normally, but in this case the API has been buggy all along and the change is needed to set things straight. There would be other ways to "fix" the API, but this is the only thing that makes sense going forward.

And sure the compat hack is incomplete, there's no way to make such a change work everywhere. The .decode() hack just happens to make a whole bunch of users "just work" with the warning, which is why it's there, and there are a thing or two that could be done to further help immediate compatibility, BUT: since these hacks also *introduce* some problems with compatibility detection, we'd like to get rid of it as soon as at all possible.

We could add some sort of flag for enabling headers to return bytes again. We could (and would like to) also introduce these things in older Fedora releases, just need to figure out how to go about it. Doing this in rawhide now is bit of a "see what breaks" exercise because there's no way to find out just by looking at sources (which I did)

--- Additional comment from Miro Hrončok on 2019-04-23 12:27:25 UTC ---

I'd argue that rawhide should not be the place to land change unannounced and wait what happens.
IMHO this deserves some kind of proper communication - at last a HEADS UP e-mail to devel-announce, if not a Fedora 31 change proposal baked by FESCo.
FedoraReview didn't even get a depends on bugzilla report like some others did.


Either way, rather than having a flag that enables bytes behavior on rawhide, I'd appreciate a flag on EL7, Fedora 29 and Fedora 30 that can give me the new behavior, so we can finally stop using the (indeed horrible) bytest API. Is that reasonable to expect? How can I help with the "figure out how to go about it" part?

--- Additional comment from Panu Matilainen on 2019-04-30 05:48:45 UTC ---

(In reply to Miro Hrončok from comment #10)
> I'd argue that rawhide should not be the place to land change unannounced
> and wait what happens.
> IMHO this deserves some kind of proper communication - at last a HEADS UP
> e-mail to devel-announce, if not a Fedora 31 change proposal baked by FESCo.
> FedoraReview didn't even get a depends on bugzilla report like some others
> did.

I generally agree about communicating such changes but this is ultimately a bugfix in a dark corner that few people use in practise, and those few affected were supposed to get a heads-up via the bug. Apologies for missing fedora-review, not sure what happened there, maybe accidentally skipped during mass-filing.

Anyway, this will soon get a wider audience with rpm 4.15 change proposal, consider this a test-drive for the feature: we're in fairly good shape now in that some of the most critical bits (anaconda and mock) have already been adjusted to the new behavior.

> Either way, rather than having a flag that enables bytes behavior on
> rawhide, I'd appreciate a flag on EL7, Fedora 29 and Fedora 30 that can give
> me the new behavior, so we can finally stop using the (indeed horrible)
> bytest API. Is that reasonable to expect? 

Landing changes into Fedora is not a problem (and planned, in a vague manner), EL7 is a different story entirely.

> How can I help with the "figure out how to go about it" part?

The devil is in the details.

If we add a forward-compat flag to older releases now, seems to me it actually just complicates the situation for callers because we dont really want such a flag in upstream so the presence (or lack of thereof) the flag doesn't mean a thing, and then it's also unlikely that we can introduce such a change into EL7 so people would still need to deal with both API styles. Not to mention all the other rpm versions in the wild (Suse, Mageia etc) - those broken Py3 bindings will be haunting people for quite some time still no matter what.

Concrete ideas on how to make this somehow sensible for the API users are most certainly welcome.

--- Additional comment from Miro Hrončok on 2019-04-30 09:06:24 UTC ---

> Landing changes into Fedora is not a problem (and planned, in a vague manner), EL7 is a different story entirely.

If this is patchable on the EPEL python3-rpm package level (without the need to patch the RHEL7 rpm package itself), it should be good.

--- Additional comment from Panu Matilainen on 2019-04-30 09:22:59 UTC ---

Ahem. Right. There is no python3-rpm in EL7 itself :D
In that case, no problem.

--- Additional comment from Miro Hrončok on 2019-04-30 09:57:22 UTC ---

As a selfish maintainer of Fedroa/EPEL only software, this is the pseudo API I'd like to have everywhere:

    rpm.TransactionSet(text_type=str/bytes/(not set))


This is what the API would do:


On Fedora 30 and earlier:

 no text_type supplied: fallbacks to bytes, no change

 text_type=str: get's the new behavior, but no str.decode() method

 text_type=bytes: maintains the previous behavior


On Fedora 31 and newer:

 no text_type supplied: fallbacks to unicode strings with your custom str.decode() method (possibly fallbacks to bytes until 4.15 with DeprecationWarning)

 text_type=str: get's the new behavior, but no str.decode() method

 text_type=bytes: maintains the previous behavior, with DeprecationWarning

--- Additional comment from Panu Matilainen on 2019-04-30 11:17:00 UTC ---

Thanks for the feedback. However the transaction set is not the right place, it has nothing to do with the data really and more importantly, headers do exist outside transaction sets. 

It'd have to be either per-header (which seems really painful), or a global module-level thing that only affects data returned from headers. So it'd probably have to be a "rpm.header_string_type" kind of thing with defaults similar to what you describe,  but it's still ugly and something upstream we wouldn't want upstream at all because it doesn't really make any sense there (having a real bytes-mode for the whole api might make some sense, but that'd be quite different from what the existing broken strings-as-bytes behavior is and wouldn't help compatibility)

--- Additional comment from Miro Hrončok on 2019-04-30 11:24:21 UTC ---

> headers do exist outside transaction sets

I wasn't sure about this, although it crossed my mind.


> It'd have to be either per-header (which seems really painful)

Very painful.


> So it'd probably have to be a "rpm.header_string_type" kind of thing

Probably yes.


> something we wouldn't want upstream at all because it doesn't really make any sense there

I cannot help with that, ultimately, that's your choice.

--- Additional comment from Miro Hrončok on 2019-05-27 08:42:25 UTC ---

Panu, what is the supported way to detect this feature early (e.g. without data)?

I want to do:


if <rpm returns unicode strings>:
    spec = rpm.spec
    hdr = rpm.hdr
else:
    class MyUnicodeSpecWrapper(rpm.spec):
        ...

    class MyUnicodeHdrWrapper(rpm.hdr):
        ...

    spec = MyUnicodeSpecWrapper
    hdr = MyUnicodeHdrWrapper

--- Additional comment from Panu Matilainen on 2019-05-28 06:46:44 UTC ---

This will return True on versions where strings are strings and False on the broken python3-versions returning strings as bytes:

def rpm_is_sane():
    s = "aaa"
    h = rpm.hdr()
    h['name'] = s
    return (h['name'] == s)

This should work on any rpm version >= 4.8.0, if you need to support older, catch the TypeError from header instance creation and return True in that case (those old versions are limited to python2)

--- Additional comment from Panu Matilainen on 2019-05-28 06:48:33 UTC ---

Oh and FWIW, once we have rpm 4.15 in rawhide I'll try to deal with bringing this to stable versions with some sort of compat flag. Details unknown for the time being, haven't had the bandwidth to think about it so far.

--- Additional comment from Miro Hrončok on 2019-05-28 10:53:08 UTC ---

Thanks.

--- Additional comment from Ben Cotton on 2019-08-13 16:56:07 UTC ---

This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.

--- Additional comment from Ben Cotton on 2019-08-13 19:08:05 UTC ---

This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.

Comment 1 David Hill 2019-12-03 13:49:32 UTC
We're hitting this in RHEL 8.1 / RHOSP15 and I can reproduce it.   We'll have to fix "rpm" in RHEL 8.1 too.

Comment 2 David Hill 2019-12-03 13:54:15 UTC
[root@undercloud-0-rhosp15 ~]# yum  > /dev/null
/usr/lib/python3.6/site-packages/dateutil/parser/_parser.py:70: UnicodeWarning: decode() called on unicode string, see https://bugzilla.redhat.com/show_bug.cgi?id=1693751
  instream = instream.decode()

Comment 4 David Hill 2019-12-03 14:11:43 UTC
[root@undercloud-0-rhosp15 ~]# rpm -qi rpm
Name        : rpm
Version     : 4.14.2
Release     : 25.el8
Architecture: x86_64
Install Date: Wed 06 Nov 2019 08:41:33 AM EST
Group       : System Environment/Base
Size        : 2117603
License     : GPLv2+
Signature   : RSA/SHA256, Wed 07 Aug 2019 10:15:05 AM EDT, Key ID 199e2f91fd431d51
Source RPM  : rpm-4.14.2-25.el8.src.rpm
Build Date  : Wed 07 Aug 2019 09:32:37 AM EDT
Build Host  : x86-vm-01.build.eng.bos.redhat.com
Relocations : (not relocatable)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
Vendor      : Red Hat, Inc.
URL         : http://www.rpm.org/
Summary     : The RPM package management system
Description :
The RPM Package Manager (RPM) is a powerful command line driven
package management system capable of installing, uninstalling,
verifying, querying, and updating software packages. Each software
package consists of an archive of files along with information about
the package like its version, a description, etc.


Note You need to log in before you can comment on or make changes to this bug.