Bug 1629340

Summary: PackageKit update crashes at end of transaction with "TransactionItem state is not set: grub2-tools-1:2.02-57.fc29.x86_64"
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: libdnfAssignee: rpm-software-management
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 29CC: dmach, gmarr, jmracek, jonathan, klember, kusmabite, mblaha, mluscon, packaging-team-maint, pjones, prd-fedora, rdieter, rhughes, robatino, rpm-software-management, smparrish
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: AcceptedBlocker
Fixed In Version: libdnf-0.19.1-3.fc29 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1632527 (view as bug list) Environment:
Last Closed: 2018-09-20 22:35:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1517011    
Attachments:
Description Flags
backtrace of the crash (from pkcon)
none
screencast of the bug happening with dnf 3.5.1 and libdnf 0.19.1 (clean Beta RC3 Workstation live install) none

Description Adam Williamson 2018-09-15 01:47:22 UTC
I wanted to test the DNF 3.5.1 update[1] to see if we should pull it into F29 Beta, so I built a live image containing those packages. That worked. I ran an install and boot of that live image. That worked. Then I tried running an offline update from the installed system. The update process got to 97% and then seemed to get stuck. I left it for over half an hour, then shut down and rebooted the system. Looking at the logs from the update boot, I see this:

Sep 14 17:56:38 localhost.localdomain packagekitd[639]: terminate called after throwing an instance of 'std::runtime_error'
Sep 14 17:56:38 localhost.localdomain packagekitd[639]:   what():  TransactionItem state is not set: grub2-tools-1:2.02-57.fc29.x86_64
Sep 14 17:56:38 localhost.localdomain systemd[1]: packagekit.service: Main process exited, code=killed, status=6/ABRT
Sep 14 17:56:38 localhost.localdomain systemd[1]: packagekit.service: Failed with result 'signal'.

it does not seem like the crash was actually captured by coredumpctl or abrt, unfortunately.

This error looks a lot like one that was claimed fixed in dnf a while ago:

https://bugzilla.redhat.com/show_bug.cgi?id=1603148 , marked dupe of:
https://bugzilla.redhat.com/show_bug.cgi?id=1601877

It also looks similar to a couple other reports:

https://bugzilla.redhat.com/show_bug.cgi?id=1622449 (for which the commit claimed as a fix is in 0.19.1, so that shouldn't be the problem here)
https://bugzilla.redhat.com/show_bug.cgi?id=1599185 (an older report which has not apparently been followed up or closed)
https://bugzilla.redhat.com/show_bug.cgi?id=1608685 (another report involving grub from July, so may be a dupe of one of the others)

I will see if this reproduces on a second try, and also see if it happens on upgrade from the RC1 image (which had dnf 3.2.0).

[1]: https://bodhi.fedoraproject.org/updates/FEDORA-2018-f16a71bc92

Comment 1 Adam Williamson 2018-09-15 03:36:37 UTC
Happened again, exactly the same way, on the second try. Will now test RC1.

Note the update that seems to have the trouble is from https://koji.fedoraproject.org/koji/buildinfo?buildID=1143964 (which is on the system after install) to https://koji.fedoraproject.org/koji/buildinfo?buildID=1144121 (which is currently in updates-testing).

Comment 2 Adam Williamson 2018-09-15 04:09:07 UTC
Beta RC1 (with dnf 3.2.0 and libdnf 0.17.0) behaves just the same :(

Proposing this as a Beta blocker, as a violation of "The installed system must be able appropriately to install, remove, and update software with the default tool for the relevant software type in all release-blocking desktops (e.g. default graphical package manager). This includes downloading of packages to be installed/updated." - GNOME Software offline update is the 'default tool' for Workstation, and with current u-t packages, this seems to reliably fail and cause a hung update.

I'll do some tests with plain dnf (not g-s) next.

Comment 3 Adam Williamson 2018-09-15 05:33:28 UTC
Just booting the installed system fresh and doing 'dnf update grub2*' doesn't hit the problem (that transaction completes fine).

Comment 4 Adam Williamson 2018-09-15 19:45:34 UTC
Is it possible PackageKit needs a change along the lines of https://github.com/rpm-software-management/dnf/pull/1134 ?

Comment 5 Adam Williamson 2018-09-15 20:48:57 UTC
`pkcon update` (from a clean Workstation live install) crashes the same way.

Comment 6 Adam Williamson 2018-09-15 21:31:46 UTC
Created attachment 1483579 [details]
backtrace of the crash (from pkcon)

Comment 7 Adam Williamson 2018-09-16 15:53:28 UTC
Just 'pkcon update grub2-tools' (from the clean installed live image) also reproduces the crash.

Comment 8 Adam Williamson 2018-09-17 19:12:42 UTC
dmach, can you please have someone look at this on the libdnf end? PackageKit maintainer says it is crashing in swdb, and he has no idea how swdb works. Thanks! We need an evaluation/fix for this urgently as go/no-go is on Thursday.

Comment 9 Geoffrey Marr 2018-09-17 20:00:15 UTC
Discussed during the 2018-09-17 blocker review meeting: [1]

The decision to classify this bug as an "AcceptedBlocker" was made as it violates the following criteria:

"The installed system must be able appropriately to install, remove, and update software with the default tool for the relevant software type in all release-blocking desktops (e.g. default graphical package manager)" for Workstation, with current repo state.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2018-09-17/f29-blocker-review.2018-09-17-16.02.txt

Comment 10 Adam Williamson 2018-09-17 21:49:08 UTC
Note, it seems like the 'interesting' thing about grub2-tools is that it obsoletes old versions of...itself:

[adamw@adam pki-core ((pki-core-10.6.6-1.fc29) %)]$ rpm -q grub2-tools
grub2-tools-2.02-58.fc29.x86_64
[adamw@adam pki-core ((pki-core-10.6.6-1.fc29) %)]$ rpm -q --obsoletes grub2-tools
grub2-tools < 1:2.02-58.fc29
[adamw@adam pki-core ((pki-core-10.6.6-1.fc29) %)]$

I don't know why it does that. But that's the obvious suspect for the 'odd' condition that causes this.

Comment 11 Adam Williamson 2018-09-17 22:00:52 UTC
CCing pjones for the slightly odd thing grub2 does here. It's not really _wrong_ per se, and package managers should certainly cope with it, but it seems unusual and I can't see what the point of it is.

Comment 12 Adam Williamson 2018-09-17 22:15:53 UTC
Ah, so I think these obsoletes: showed up when there was a subpackage split in grub2.

The generic scenario is like this. Say you have package 'foo', version 1.0-1, and you want to split it into 'foo' and 'foo-extras' in 2.0-1. You want systems that already have 'foo' installed to get both 'foo' and 'foo-extras' when they update (to make sure they don't lose anything), but you don't want 'foo' to actually depend on 'foo-extras' going forward. In that case, what you have to do is make both foo-2.0-1 and foo-extras-2.0-1 obsolete the old foo, e.g.:

Obsoletes: foo < 2.0-1

that's a reason you'd have 'foo' obsolete an old version of itself. It seems grub2 did some splits like this in the past. I'd expect these obsoletes to be specifically versioned to cover the actual point in the package history where the splits occurred, and it seems like at first they actually were, but in https://src.fedoraproject.org/rpms/grub2/c/ecef1ed7b50ed05b65574c8b8815d7ae66e5a0a9 , for some reason, a lot of the Obsoletes: were rejigged and they all wound up being "< %{evr}" instead of "< (some specific version)".

If I'm right that this crash happens any time we are upgrading some package to a new version which Obsoletes the currently-installed version, then it is a real problem, because it will happen any time anyone is doing one of these subpackage splits properly.

Comment 13 Adam Williamson 2018-09-17 23:03:04 UTC
so packagekit's code here is really very simple and more or less amounts to getting the package ids to be updated using dnf_utils_find_package_ids , then calling hy_goal_upgrade_to on each one, then running the transaction. That's basically all it does. So if something like what was done in dnf needs to be done for this case, it feels to me like it ought to be done *in libdnf*.

Comment 14 Daniel Mach 2018-09-18 08:56:14 UTC
I believe this was fixed in libdnf-0.19.0 :
a46e66c5 [transaction] Avoid adding duplicates via Transaction::addItem().

(duplicates in the transaction/history caused that rpm transaction callbacks set state of the first item and remaining occurences of the same item weren't changed and it ended up with the exception "TransactionItem state is not set") 

You mentioned upgrade from dnf 3.2 which used libdnf 1.17.
The issue cannot be fixed as the upgrade is performed by the old dnf and libdnf.
If you upgrade dnf and libdnf first and then upgrade the rest, everything should work as expected.

To me, it's CLOSED/CURRENTRELEASE already.

Comment 15 Adam Williamson 2018-09-18 15:26:32 UTC
...can you please read more closely? I already explicitly pointed out that commit (and the bug report that lead to it) and said it is *not* the problem, as this was tested with DNF 3.5.1 and libdnf 0.19.1. I *later* tested with DNF 3.2.0 and libdnf 0.17 to check whether the bug was new or not; it is not, the same bug happens in both.

You can reproduce this for yourself, quite easily, by getting the RC3 Workstation live:

https://kojipkgs.fedoraproject.org/compose/29/Fedora-29-20180916.0/compose/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-29_Beta-1.3.iso

installing it, and from the installed system, running 'pkcon update grub2-tools'. As you do so, you can check for yourself that it includes DNF 3.5.1 and libdnf 0.19.1.

Comment 16 Adam Williamson 2018-09-18 16:30:03 UTC
Created attachment 1484424 [details]
screencast of the bug happening with dnf 3.5.1 and libdnf 0.19.1 (clean Beta RC3 Workstation live install)

Here is a *screencast* of me reproducing the bug on Beta RC3 Workstation live with DNF 3.5.1 and libdnf 0.19.1.

Comment 18 Adam Williamson 2018-09-18 23:33:55 UTC
Fix works in a quick test here. I'm going to fire a build so we can put it in an RC4, as we're on a very tight time frame for Beta. If it turns out to fail CI or be bad in some other way, we can just throw away RC4.

Comment 19 Fedora Update System 2018-09-19 00:19:37 UTC
anaconda-29.24.3-1.fc29 dnf-3.5.1-1.fc29 dnf-plugins-core-3.0.3-1.fc29 libdnf-0.19.1-3.fc29 lorax-29.12-2.fc29 python-blivet-3.1.0-2.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2018-f16a71bc92

Comment 20 Fedora Update System 2018-09-20 16:16:23 UTC
anaconda-29.24.3-1.fc29, dnf-3.5.1-1.fc29, dnf-plugins-core-3.0.3-1.fc29, libdnf-0.19.1-3.fc29, lorax-29.12-2.fc29, python-blivet-3.1.0-2.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-f16a71bc92

Comment 21 Fedora Update System 2018-09-20 22:35:41 UTC
anaconda-29.24.3-1.fc29, dnf-3.5.1-1.fc29, dnf-plugins-core-3.0.3-1.fc29, libdnf-0.19.1-3.fc29, lorax-29.12-2.fc29, python-blivet-3.1.0-2.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Comment 22 Daniel Mach 2019-03-08 06:43:16 UTC
*** Bug 1599185 has been marked as a duplicate of this bug. ***

Comment 23 Daniel Mach 2019-03-08 06:44:10 UTC
*** Bug 1608685 has been marked as a duplicate of this bug. ***