Bug 1727424

Summary: libdnf 0.35.1 crashes with "Assertion `repoImpl->libsolvRepo == repo' failed"
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: libdnfAssignee: Jaroslav Rohel <jrohel>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: high    
Version: rawhideCC: bojan, dmach, fzatlouk, jmracek, jrohel, ksrot, mblaha, pkratoch, robatino, rpm-software-management, yaneti
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: openqa AcceptedBlocker
Fixed In Version: libdnf-0.35.1-2.fc30 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1730224 (view as bug list) Environment:
Last Closed: 2019-07-23 15:16:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1644937, 1730224    

Description Adam Williamson 2019-07-06 04:36:53 UTC
We seem to be hitting multiple different (but probably related?) crashes in libdnf 0.35.1, when PackageKit uses libdnf. This is one of them.

This seems to be the crash hit by both a test which involves enrolling to a FreeIPA domain via Cockpit (which in turn uses PackageKit to install some necessary client packages), and a test which just tests installing updates for a GNOME system with GNOME Software (when refreshing available updates). The journal logs this:

Jul 05 18:08:07 client002.domain.local packagekitd[1660]: packagekitd: /builddir/build/BUILD/libdnf-0.35.1/libdnf/repo/Repo.cpp:1826: void repo_internalize_trigger(Repo*): Assertion `repoImpl->libsolvRepo == repo' failed.

And the backtrace I've got looks like this:

#0  0x00007fee774ee615 in raise () at /lib64/libc.so.6
#1  0x00007fee774d78d9 in abort () at /lib64/libc.so.6
#2  0x00007fee774d77a9 in _nl_load_domain.cold () at /lib64/libc.so.6
#3  0x00007fee774e6a56 in annobin_assert.c_end () at /lib64/libc.so.6
#4  0x00007fee68679a50 in repo_internalize_trigger(s_Repo*) (repo=0x7fee50493510) at /usr/src/debug/libdnf-0.35.1-1.fc31.x86_64/libdnf/repo/Repo.cpp:1821
#5  0x00007fee68679a50 in repo_internalize_trigger(s_Repo*) (repo=0x7fee50493510) at /usr/src/debug/libdnf-0.35.1-1.fc31.x86_64/libdnf/repo/Repo.cpp:1821
#6  0x00007fee685d186f in dnf_package_get_location(DnfPackage*) (pkg=<optimized out>) at /usr/src/debug/libdnf-0.35.1-1.fc31.x86_64/libdnf/hy-package.cpp:320
        s = 0x7fee38049e18
#7  0x00007fee685e2682 in dnf_package_get_filename(DnfPackage*) (pkg=0x7fee43b87f90) at /usr/src/debug/libdnf-0.35.1-1.fc31.x86_64/libdnf/dnf-package.cpp:128
        basename = 0x0
        priv = 0x7fee434542c0
#8  0x00007fee685e2fb0 in dnf_package_check_filename(DnfPackage*, gboolean*, GError**) (pkg=pkg@entry=0x7fee43b87f90, valid=valid@entry=0x7fee4fffec04, error=error@entry=0x7fee4fffed10)
    at /usr/src/debug/libdnf-0.35.1-1.fc31.x86_64/libdnf/dnf-package.cpp:658
        checksum_type_lr = <optimized out>
        checksum_valid = 0x0
        path = <optimized out>
        checksum = <optimized out>
        ret = 1
        checksum_type_hy = 32750
        fd = <optimized out>
#9  0x00007fee685eceef in dnf_transaction_depsolve(DnfTransaction*, HyGoal, DnfState*, GError**) (transaction=0x564cfd36dab0, goal=<optimized out>, state=<optimized out>, error=error@entry=0x7fee4fffed10)
    at /usr/src/debug/libdnf-0.35.1-1.fc31.x86_64/libdnf/dnf-transaction.cpp:992
        pkg = 0x7fee43b87f90
        i = 0
        priv = <optimized out>
        valid = 32750
        packages = 0x7fee43bfc020
#10 0x00007fee68787ef2 in pk_backend_transaction_run (job=0x564cfd600ac0, state=0x7fee50001b40, error=0x7fee4fffed10) at pk-backend-dnf.c:2534
        state_local = <optimized out>
        job_data = 0x564cfd5fd570
        ret = <optimized out>
        flags = <optimized out>
#11 0x00007fee68789b74 in pk_backend_install_packages_thread (job=0x564cfd600ac0, params=<optimized out>, user_data=<optimized out>) at pk-backend-dnf.c:3121
        state_local = <optimized out>
        pkg = <optimized out>
        job_data = 0x564cfd5fd570
        filters = 8
        ret = <optimized out>
        i = <optimized out>
        relations = 0x7fee43694f90
        package_ids = 0x7fee5c00b2f0
        sack = 0x7fee50004910
        error = 0x0
        hash = 0x7fee384fbb00
        __func__ = "pk_backend_install_packages_thread"
#12 0x0000564cfc4026ae in pk_backend_job_thread_setup (thread_data=0x564cfd6425e0) at pk-backend-job.c:726
        helper = 0x564cfd6425e0
#13 0x00007fee777e0962 in g_thread_proxy () at /lib64/libglib-2.0.so.0
#14 0x00007fee776824e2 in start_thread () at /lib64/libpthread.so.0
#15 0x00007fee775b2623 in clone () at /lib64/libc.so.6

This is happening to both Fedora 30 with the libdnf 0.35.1 update in updates-testing, and Rawhide. This is an obvious Beta blocker for Fedora 31, violates e.g. "The installed system must be able appropriately to install, remove, and update software with the default tool for the relevant software type in all release-blocking desktops (e.g. default graphical package manager)" - https://fedoraproject.org/wiki/Fedora_31_Beta_Release_Criteria#Installing.2C_removing_and_updating_software

See also the different but likely related crash Matt Fagnani reported at https://bugzilla.redhat.com/show_bug.cgi?id=1727343 .

Comment 1 Jaroslav Mracek 2019-07-07 11:42:40 UTC
Thank you very much for the report. I would like to ask you for a simple reproducer? I just tried "pkcon install acpi" and it worked like expected. I would prefer a reproducer without cockpit or at least with description for a person without a previous experience with cockpit. We taking the issue seriously.

Comment 2 Adam Williamson 2019-07-07 16:08:02 UTC
Using the Cockpit reproducer would be tedious, as you need to set up a whole FreeIPA server (although if there's any other Cockpit operation which causes it to install packages, those might trigger it too). Running GNOME Software and trying to refresh available updates may do it. I do have the same crash here on my own Rawhide desktop, presumably from a background update refresh attempt.

I just fiddled about with it a bit here, and I was able to reproduce it by killing all running gnome-software processes, restarting packagekit, and running gnome-software. Just did it again and it crashed again, so that's 2 for 2. Try that?

Comment 4 Jaroslav Mracek 2019-07-08 12:30:54 UTC
The issue is very difficult to resolve without a reproducer. I create a patch https://github.com/rpm-software-management/libdnf/pull/759 that theoretically could help.

Please could you:
1. Try the patch if it resolves the issue?
2. Please could you try to reproduce the issue with libdnf-0.33 and libdnf-0.31?


Thanks a lot

Comment 5 Daniel Mach 2019-07-08 13:04:09 UTC
A scratch build with the patch:
https://koji.fedoraproject.org/koji/taskinfo?taskID=36130933

Comment 6 Adam Williamson 2019-07-08 17:33:53 UTC
Jaroslav: as noted on IRC, I put a reproducer that worked for me in my comment.

I'm pretty sure the bug didn't happen with 0.31, as that was the version previously in Rawhide and we weren't hitting this crash till 0.35 landed. I don't know about 0.33, as that never made it to a Rawhide compose. I will check that, and the scratch build.

Comment 7 Adam Williamson 2019-07-09 02:20:56 UTC
No, the scratch build doesn't help. My reproducer (killall gnome-software, systemctl restart packagekit, gnome-software) still crashes packagekitd, with the same assertion error.

I'll test with 0.33.

Comment 8 Adam Williamson 2019-07-09 05:55:06 UTC
0.33 does not seem to have the bug. I'll triage more tomorrow. If I had to guess a suspect...maybe ce7d1f25681c42079c348328bdfae26eb23d3051 ?

Comment 9 Adam Williamson 2019-07-09 19:51:57 UTC
OK, I was one commit out :). Bisected down to this commit:

https://github.com/rpm-software-management/libdnf/commit/61a235c960b552640e73909c5bc52585c5a3f844

which, now I look at it, has this rather smoking gun-looking line:

https://github.com/rpm-software-management/libdnf/commit/61a235c960b552640e73909c5bc52585c5a3f844#diff-f5b2fb1705fa70e7aeb3eb12b877c6feR1419

...so, yeah.

Comment 10 Daniel Mach 2019-07-11 14:10:01 UTC
That line should be ok, the null pointer is an initial value that is overridden later on.

The problem occurs when there's repo with enabled=0, enabled_metadata=1; typically fedora-cisco-openh264 on Fedora.
Reason for the crash is that refcount to the repo object is decreased, repo gets deallocated and is used afterwards, which triggers the crash.
We're still unable to identify the root cause - no idea why refcount gets decreased for repos with enabled=0, enabled_metadata=1.

I managed to fix that in PackageKit, simply by postponing the deallocation:
https://koji.fedoraproject.org/koji/taskinfo?taskID=36183157

But we're still trying to discover the root cause in libdnf and understand what's going on.

Comment 11 Daniel Mach 2019-07-12 08:00:59 UTC
Adam,
after couple days of reviewing code and Repo implementation,
we have came to a conclusion that you were absolutely right about the place where it breaks.
Jaroslav Rohel is working on a fix.

The code is unnecessarily complicated and re-initializes the underlying libsolvRepo several times
and the work with references is far from ideal.
Unfortunately the code cannot be probably simplified without breaking the current C API
-> we'll do that in the next major libdnf version.

Comment 12 Jaroslav Rohel 2019-07-12 12:23:49 UTC
I added PR https://github.com/rpm-software-management/libdnf/pull/761 . But I found another problem during CI tests, so the PR is blocked until I fix it.

Comment 13 FrantiĊĦek Zatloukal 2019-07-16 07:42:28 UTC
Discussed during the 2019-07-15 blocker review meeting: [1]

The decision to classify this bug as an AcceptedBlocker was made:

"AFAWCS, these two crashes have the same basic cause and break GNOME Software and Cockpit package installation. They do not seem to happen 100% of the time but on current information we think they're significant enough to violate "The installed system must be able appropriately to install software with the default tool for the relevant software type in all release-blocking desktops"

[1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2019-07-15/f31-blocker-review.2019-07-15-16.05.log.txt

Comment 14 Pavla Kratochvilova 2019-07-18 07:39:45 UTC
Hello,

I made scratch builds with the newest patch.
rawhide: https://koji.fedoraproject.org/koji/taskinfo?taskID=36317514
Fedora 30: https://koji.fedoraproject.org/koji/taskinfo?taskID=36317358

Can you please confirm the issue is fixed there? Thank you.

Comment 15 Adam Williamson 2019-07-18 15:54:23 UTC
OK, https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=30&build=Kojitask-36317358-NOREPORT&groupid=2 is testing that build; if the realmd_join_cockpit and desktop_update_graphical tests pass, that would indicate the bug is fixed.

Comment 16 Adam Williamson 2019-07-18 20:49:57 UTC
All tests passed, so from the looks of that, the fix does work. Thanks.

Comment 17 Fedora Update System 2019-07-23 07:17:02 UTC
FEDORA-2019-672a74d688 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-672a74d688

Comment 18 Adam Williamson 2019-07-23 15:16:53 UTC
This bug was filed against Rawhide, so we can just close it at this point. (We could probably close the other one too as the bad update never made it out of u-t, but meh).

Comment 19 Fedora Update System 2019-07-24 01:42:00 UTC
dnf-4.2.7-2.fc30, libdnf-0.35.1-2.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-672a74d688

Comment 20 Fedora Update System 2019-07-30 01:14:37 UTC
dnf-4.2.7-2.fc30, libdnf-0.35.1-2.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.