We seem to be hitting multiple different (but probably related?) crashes in libdnf 0.35.1, when PackageKit uses libdnf. This is one of them.
This seems to be the crash hit by both a test which involves enrolling to a FreeIPA domain via Cockpit (which in turn uses PackageKit to install some necessary client packages), and a test which just tests installing updates for a GNOME system with GNOME Software (when refreshing available updates). The journal logs this:
Jul 05 18:08:07 client002.domain.local packagekitd: packagekitd: /builddir/build/BUILD/libdnf-0.35.1/libdnf/repo/Repo.cpp:1826: void repo_internalize_trigger(Repo*): Assertion `repoImpl->libsolvRepo == repo' failed.
And the backtrace I've got looks like this:
#0 0x00007fee774ee615 in raise () at /lib64/libc.so.6
#1 0x00007fee774d78d9 in abort () at /lib64/libc.so.6
#2 0x00007fee774d77a9 in _nl_load_domain.cold () at /lib64/libc.so.6
#3 0x00007fee774e6a56 in annobin_assert.c_end () at /lib64/libc.so.6
#4 0x00007fee68679a50 in repo_internalize_trigger(s_Repo*) (repo=0x7fee50493510) at /usr/src/debug/libdnf-0.35.1-1.fc31.x86_64/libdnf/repo/Repo.cpp:1821
#5 0x00007fee68679a50 in repo_internalize_trigger(s_Repo*) (repo=0x7fee50493510) at /usr/src/debug/libdnf-0.35.1-1.fc31.x86_64/libdnf/repo/Repo.cpp:1821
#6 0x00007fee685d186f in dnf_package_get_location(DnfPackage*) (pkg=<optimized out>) at /usr/src/debug/libdnf-0.35.1-1.fc31.x86_64/libdnf/hy-package.cpp:320
s = 0x7fee38049e18
#7 0x00007fee685e2682 in dnf_package_get_filename(DnfPackage*) (pkg=0x7fee43b87f90) at /usr/src/debug/libdnf-0.35.1-1.fc31.x86_64/libdnf/dnf-package.cpp:128
basename = 0x0
priv = 0x7fee434542c0
#8 0x00007fee685e2fb0 in dnf_package_check_filename(DnfPackage*, gboolean*, GError**) (pkg=pkg@entry=0x7fee43b87f90, valid=valid@entry=0x7fee4fffec04, error=error@entry=0x7fee4fffed10)
checksum_type_lr = <optimized out>
checksum_valid = 0x0
path = <optimized out>
checksum = <optimized out>
ret = 1
checksum_type_hy = 32750
fd = <optimized out>
#9 0x00007fee685eceef in dnf_transaction_depsolve(DnfTransaction*, HyGoal, DnfState*, GError**) (transaction=0x564cfd36dab0, goal=<optimized out>, state=<optimized out>, error=error@entry=0x7fee4fffed10)
pkg = 0x7fee43b87f90
i = 0
priv = <optimized out>
valid = 32750
packages = 0x7fee43bfc020
#10 0x00007fee68787ef2 in pk_backend_transaction_run (job=0x564cfd600ac0, state=0x7fee50001b40, error=0x7fee4fffed10) at pk-backend-dnf.c:2534
state_local = <optimized out>
job_data = 0x564cfd5fd570
ret = <optimized out>
flags = <optimized out>
#11 0x00007fee68789b74 in pk_backend_install_packages_thread (job=0x564cfd600ac0, params=<optimized out>, user_data=<optimized out>) at pk-backend-dnf.c:3121
state_local = <optimized out>
pkg = <optimized out>
job_data = 0x564cfd5fd570
filters = 8
ret = <optimized out>
i = <optimized out>
relations = 0x7fee43694f90
package_ids = 0x7fee5c00b2f0
sack = 0x7fee50004910
error = 0x0
hash = 0x7fee384fbb00
__func__ = "pk_backend_install_packages_thread"
#12 0x0000564cfc4026ae in pk_backend_job_thread_setup (thread_data=0x564cfd6425e0) at pk-backend-job.c:726
helper = 0x564cfd6425e0
#13 0x00007fee777e0962 in g_thread_proxy () at /lib64/libglib-2.0.so.0
#14 0x00007fee776824e2 in start_thread () at /lib64/libpthread.so.0
#15 0x00007fee775b2623 in clone () at /lib64/libc.so.6
This is happening to both Fedora 30 with the libdnf 0.35.1 update in updates-testing, and Rawhide. This is an obvious Beta blocker for Fedora 31, violates e.g. "The installed system must be able appropriately to install, remove, and update software with the default tool for the relevant software type in all release-blocking desktops (e.g. default graphical package manager)" - https://fedoraproject.org/wiki/Fedora_31_Beta_Release_Criteria#Installing.2C_removing_and_updating_software
See also the different but likely related crash Matt Fagnani reported at https://bugzilla.redhat.com/show_bug.cgi?id=1727343 .
Thank you very much for the report. I would like to ask you for a simple reproducer? I just tried "pkcon install acpi" and it worked like expected. I would prefer a reproducer without cockpit or at least with description for a person without a previous experience with cockpit. We taking the issue seriously.
Using the Cockpit reproducer would be tedious, as you need to set up a whole FreeIPA server (although if there's any other Cockpit operation which causes it to install packages, those might trigger it too). Running GNOME Software and trying to refresh available updates may do it. I do have the same crash here on my own Rawhide desktop, presumably from a background update refresh attempt.
I just fiddled about with it a bit here, and I was able to reproduce it by killing all running gnome-software processes, restarting packagekit, and running gnome-software. Just did it again and it crashed again, so that's 2 for 2. Try that?
The issue is very difficult to resolve without a reproducer. I create a patch https://github.com/rpm-software-management/libdnf/pull/759 that theoretically could help.
Please could you:
1. Try the patch if it resolves the issue?
2. Please could you try to reproduce the issue with libdnf-0.33 and libdnf-0.31?
Thanks a lot
A scratch build with the patch:
Jaroslav: as noted on IRC, I put a reproducer that worked for me in my comment.
I'm pretty sure the bug didn't happen with 0.31, as that was the version previously in Rawhide and we weren't hitting this crash till 0.35 landed. I don't know about 0.33, as that never made it to a Rawhide compose. I will check that, and the scratch build.
No, the scratch build doesn't help. My reproducer (killall gnome-software, systemctl restart packagekit, gnome-software) still crashes packagekitd, with the same assertion error.
I'll test with 0.33.
0.33 does not seem to have the bug. I'll triage more tomorrow. If I had to guess a suspect...maybe ce7d1f25681c42079c348328bdfae26eb23d3051 ?
OK, I was one commit out :). Bisected down to this commit:
which, now I look at it, has this rather smoking gun-looking line:
That line should be ok, the null pointer is an initial value that is overridden later on.
The problem occurs when there's repo with enabled=0, enabled_metadata=1; typically fedora-cisco-openh264 on Fedora.
Reason for the crash is that refcount to the repo object is decreased, repo gets deallocated and is used afterwards, which triggers the crash.
We're still unable to identify the root cause - no idea why refcount gets decreased for repos with enabled=0, enabled_metadata=1.
I managed to fix that in PackageKit, simply by postponing the deallocation:
But we're still trying to discover the root cause in libdnf and understand what's going on.
after couple days of reviewing code and Repo implementation,
we have came to a conclusion that you were absolutely right about the place where it breaks.
Jaroslav Rohel is working on a fix.
The code is unnecessarily complicated and re-initializes the underlying libsolvRepo several times
and the work with references is far from ideal.
Unfortunately the code cannot be probably simplified without breaking the current C API
-> we'll do that in the next major libdnf version.
I added PR https://github.com/rpm-software-management/libdnf/pull/761 . But I found another problem during CI tests, so the PR is blocked until I fix it.
Discussed during the 2019-07-15 blocker review meeting: 
The decision to classify this bug as an AcceptedBlocker was made:
"AFAWCS, these two crashes have the same basic cause and break GNOME Software and Cockpit package installation. They do not seem to happen 100% of the time but on current information we think they're significant enough to violate "The installed system must be able appropriately to install software with the default tool for the relevant software type in all release-blocking desktops"
I made scratch builds with the newest patch.
Fedora 30: https://koji.fedoraproject.org/koji/taskinfo?taskID=36317358
Can you please confirm the issue is fixed there? Thank you.
OK, https://openqa.stg.fedoraproject.org/tests/overview?distri=fedora&version=30&build=Kojitask-36317358-NOREPORT&groupid=2 is testing that build; if the realmd_join_cockpit and desktop_update_graphical tests pass, that would indicate the bug is fixed.
All tests passed, so from the looks of that, the fix does work. Thanks.
FEDORA-2019-672a74d688 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-672a74d688
This bug was filed against Rawhide, so we can just close it at this point. (We could probably close the other one too as the bad update never made it out of u-t, but meh).
dnf-4.2.7-2.fc30, libdnf-0.35.1-2.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-672a74d688
dnf-4.2.7-2.fc30, libdnf-0.35.1-2.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.