Created attachment 1682113 [details]
Description of problem:
Cockpit tests found that libvirtd often dumps core.
I don't have specific reproducer, just description what is happening and backtrace as I am not able to reproduce this on commandline (yet our tests fail on it all the time).
Version-Release number of selected component (if applicable):
1. Have one running VM
2. Connect to the API (through socket activated libvirtd.service) - this is done thorough Cockpit
3. systemctl stop libvirtd-ro.socket libvirtd.socket libvirtd-admin.socket
4. systemctl disable libvirtd.service
5. systemctl stop libvirtd.service
6. systemctl enable --now libvirtd.socket
7. Connect to the API (through socket activated libvirtd.service)
8. systemctl stop libvirtd.service
The last step seems to cause the problems when I stop the service too quickly. If I wait between steps 7 and 8 for a few seconds, I cannot get this coredump.
I am attaching `coredumpctl debug - thread apply all bt full` as attachment due to being too long. Hope it is enough to get idea what is happening.
`libvird` does not coredump.
When I tried to installed better debuginfos, I installed what gdb suggested:
dnf debuginfo install audit-libs-3.0-0.19.20191104git1c2f876.fc32.x86_64 augeas-libs-1.12.0-3.fc32.x86_64 cyrus-sasl-lib-2.1.27-4.fc32.x86_64 dbus-libs-1.12.16-4.fc32.x86_64 device-mapper-libs-1.02.171-1.fc32.x86_64 glib2-2.64.1-1.fc32.x86_64 glibc-2.31-2.fc32.x86_64 glusterfs-api-7.4-1.fc32.x86_64 glusterfs-libs-7.4-1.fc32.x86_64 gmp-6.1.2-13.fc32.x86_64 gnutls-3.6.13-1.fc32.x86_64 keyutils-libs-1.6-4.fc32.x86_64 krb5-libs-1.18-1.fc32.x86_64 libacl-2.2.53-5.fc32.x86_64 libattr-2.4.48-8.fc32.x86_64 libblkid-2.35.1-7.fc32.x86_64 libcap-ng-0.7.10-2.fc32.x86_64 libcom_err-1.45.5-3.fc32.x86_64 libcurl-7.69.1-1.fc32.x86_64 libffi-3.1-24.fc32.x86_64 libgcc-10.0.1-0.11.fc32.x86_64 libgcrypt-1.8.5-3.fc32.x86_64 libibverbs-28.0-1.fc32.x86_64 libiscsi-1.18.0-9.fc32.x86_64 libmount-2.35.1-7.fc32.x86_64 libnghttp2-1.40.0-2.fc32.x86_64 libnl3-3.5.0-2.fc32.x86_64 libpciaccess-0.16-2.fc32.x86_64 libpsl-0.21.0-4.fc32.x86_64 libselinux-3.0-3.fc32.x86_64 libssh-0.9.3-2.fc32.x86_64 libssh2-1.9.0-5.fc32.x86_64 libstdc++-10.0.1-0.11.fc32.x86_64 libtasn1-4.16.0-1.fc32.x86_64 libtirpc-1.2.5-1.rc2.fc32.x86_64 libunistring-0.9.10-7.fc32.x86_64 libuuid-2.35.1-7.fc32.x86_64 libwsman1-2.6.8-12.fc32.x86_64 libxslt-1.1.34-1.fc32.x86_64 lttng-ust-2.11.0-4.fc32.x86_64 lz4-libs-1.9.1-2.fc32.x86_64 netcf-libs-0.2.8-15.fc32.x86_64 nettle-3.5.1-5.fc32.x86_64 nspr-4.25.0-1.fc32.x86_64 nss-util-3.51.0-1.fc32.x86_64 numactl-libs-2.0.12-4.fc32.x86_64 openldap-2.4.47-4.fc32.x86_64 openssl-libs-1.1.1d-7.fc32.x86_64 p11-kit-0.23.20-1.fc32.x86_64 pcre-8.44-1.fc32.x86_64 pcre2-10.34-9.fc32.x86_64 systemd-libs-245.4-1.fc32.x86_64 userspace-rcu-0.11.1-3.fc32.x86_64 yajl-2.1.0-14.fc32.x86_64
After that I was not able to reproduce this issue anymore.
The stack trace shows that the daemon worker thread is processing an API call in the interface driver. Meanwhile the main thread is free'ing the virNetDaemon object. For the main thread to be doing this, it must have already de-initialized the drivers, which is bad, because the worker threads are still active.
Essentially we're lacking correct synchronization in the cleanup path. We must stop processing API calls and wait for the worker threads to become idle before we clean up anything. If this isn't possible, we must exit without performing the manual cleanup.
I saw the same issue in fedora 31 with updates-testing enabled: https://github.com/cockpit-project/bots/pull/787
*** Bug 1832801 has been marked as a duplicate of this bug. ***
I just filed bug 1832801 when investigating this, with a detailled symbolic stack trace as well. It looks a bit different in the augeas details, but it's still in the virConnectListAllInterfaces() → augeas code path. Perhaps the trace is useful. I can reproduce this at will, it happens 100% on my system with our test.
FYI, we believe we understand the problem & required changes at a high level: https://www.redhat.com/archives/libvir-list/2020-April/msg01328.html
BTW to set expectations - this is not going to be an easy fix, and it isn't certain we'll be backporting any fix to stable Fedora. So if this is impacting CI testing, I'd recommend attempting to workaround it by not shutting down libvirtd immediately after starting - give it 5 seconds of running before shutting it down.
Thanks Daniel! Is there a more specific thing that we can wait for rather than just a static "sleep 5"? That's not going to be very reliable in our busy CI machines. I. e. is there some meaningful waiting loop with virsh or busctl on libvirt-dbus perhaps?
You can use "virsh uri" as a liveliness test to validate that it is finished starting up. There is still a race in shutdown wrt currently running API calls that is shown in the stack traces.
The trace shows it is still executing an API call at the point where you are telling it to shutdown. The stack trace shows:
#14 0x00007f4f6d8ce38f virConnectListAllInterfaces (libvirt.so.0 + 0x32838f)
this is an API call that is in progress by one of the clients you have connected. Meanwhile something else has told libvirtd to shutdown.
So ideally you would fix the CI to not try to shutdown libvirtd while you are still waiting for an API call to finish.
I am experimenting with calling "virsh domifaddr 1" before stopping, to wait until that code path finished. I have no real idea what I am doing of course, but at least the test succeeded 7 times in a row locally. I have thrown it against the CI wall now: https://github.com/cockpit-project/cockpit/pull/14043
If that works, it would be a robust enough workaround for this. Thanks Daniel for your fast reply!
This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.
Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 32 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
Thank you for reporting this bug and we are sorry it could not be fixed.