Bug 1828207 - Libvirtd dumps core when stopped too quickly
Summary: Libvirtd dumps core when stopped too quickly
Alias: None
Product: Fedora
Classification: Fedora
Component: libvirt
Version: 32
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
Assignee: Libvirt Maintainers
QA Contact: Fedora Extras Quality Assurance
: 1832801 (view as bug list)
Depends On:
TreeView+ depends on / blocked
Reported: 2020-04-27 10:19 UTC by Matej Marušák
Modified: 2021-05-25 16:00 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2021-05-25 16:00:35 UTC
Type: Bug

Attachments (Terms of Use)
Backtrace (42.38 KB, text/plain)
2020-04-27 10:19 UTC, Matej Marušák
no flags Details

Description Matej Marušák 2020-04-27 10:19:24 UTC
Created attachment 1682113 [details]

Description of problem:
Cockpit tests found that libvirtd often dumps core.
I don't have specific reproducer, just description what is happening and backtrace as I am not able to reproduce this on commandline (yet our tests fail on it all the time).

Version-Release number of selected component (if applicable):

How reproducible:
1. Have  one running VM
2. Connect to the API (through socket activated libvirtd.service) - this is done thorough Cockpit
3. systemctl stop libvirtd-ro.socket libvirtd.socket libvirtd-admin.socket
4. systemctl disable libvirtd.service
5. systemctl stop libvirtd.service
6. systemctl enable --now libvirtd.socket
7. Connect to the API (through socket activated libvirtd.service)
8. systemctl stop libvirtd.service

The last step seems to cause the problems when I stop the service too quickly. If I wait between steps 7 and 8 for a few seconds, I cannot get this coredump.

I am attaching `coredumpctl debug - thread apply all bt full` as attachment due to being too long. Hope it is enough to get idea what is happening.

Expected results:
`libvird` does not coredump.


When I tried to installed better debuginfos, I installed what gdb suggested:
dnf debuginfo install audit-libs-3.0-0.19.20191104git1c2f876.fc32.x86_64 augeas-libs-1.12.0-3.fc32.x86_64 cyrus-sasl-lib-2.1.27-4.fc32.x86_64 dbus-libs-1.12.16-4.fc32.x86_64 device-mapper-libs-1.02.171-1.fc32.x86_64 glib2-2.64.1-1.fc32.x86_64 glibc-2.31-2.fc32.x86_64 glusterfs-api-7.4-1.fc32.x86_64 glusterfs-libs-7.4-1.fc32.x86_64 gmp-6.1.2-13.fc32.x86_64 gnutls-3.6.13-1.fc32.x86_64 keyutils-libs-1.6-4.fc32.x86_64 krb5-libs-1.18-1.fc32.x86_64 libacl-2.2.53-5.fc32.x86_64 libattr-2.4.48-8.fc32.x86_64 libblkid-2.35.1-7.fc32.x86_64 libcap-ng-0.7.10-2.fc32.x86_64 libcom_err-1.45.5-3.fc32.x86_64 libcurl-7.69.1-1.fc32.x86_64 libffi-3.1-24.fc32.x86_64 libgcc-10.0.1-0.11.fc32.x86_64 libgcrypt-1.8.5-3.fc32.x86_64 libibverbs-28.0-1.fc32.x86_64 libiscsi-1.18.0-9.fc32.x86_64 libmount-2.35.1-7.fc32.x86_64 libnghttp2-1.40.0-2.fc32.x86_64 libnl3-3.5.0-2.fc32.x86_64 libpciaccess-0.16-2.fc32.x86_64 libpsl-0.21.0-4.fc32.x86_64 libselinux-3.0-3.fc32.x86_64 libssh-0.9.3-2.fc32.x86_64 libssh2-1.9.0-5.fc32.x86_64 libstdc++-10.0.1-0.11.fc32.x86_64 libtasn1-4.16.0-1.fc32.x86_64 libtirpc-1.2.5-1.rc2.fc32.x86_64 libunistring-0.9.10-7.fc32.x86_64 libuuid-2.35.1-7.fc32.x86_64 libwsman1-2.6.8-12.fc32.x86_64 libxslt-1.1.34-1.fc32.x86_64 lttng-ust-2.11.0-4.fc32.x86_64 lz4-libs-1.9.1-2.fc32.x86_64 netcf-libs-0.2.8-15.fc32.x86_64 nettle-3.5.1-5.fc32.x86_64 nspr-4.25.0-1.fc32.x86_64 nss-util-3.51.0-1.fc32.x86_64 numactl-libs-2.0.12-4.fc32.x86_64 openldap-2.4.47-4.fc32.x86_64 openssl-libs-1.1.1d-7.fc32.x86_64 p11-kit-0.23.20-1.fc32.x86_64 pcre-8.44-1.fc32.x86_64 pcre2-10.34-9.fc32.x86_64 systemd-libs-245.4-1.fc32.x86_64 userspace-rcu-0.11.1-3.fc32.x86_64 yajl-2.1.0-14.fc32.x86_64
After that I was not able to reproduce this issue anymore.

Comment 1 Daniel Berrangé 2020-04-27 10:31:15 UTC
The stack trace shows that the daemon worker thread is processing an API call in the interface driver. Meanwhile the main thread is free'ing the virNetDaemon object.  For the main thread to be doing this, it must have already de-initialized the drivers, which is bad, because the worker threads are still active.

Essentially we're lacking correct synchronization in the cleanup path. We must stop processing API calls and wait for the worker threads to become idle before we clean up anything. If this isn't possible, we must exit without performing the manual cleanup.

Comment 2 Martin Pitt 2020-04-27 18:20:13 UTC
I saw the same issue in fedora 31 with updates-testing enabled: https://github.com/cockpit-project/bots/pull/787

Comment 3 Martin Pitt 2020-05-07 10:15:03 UTC
*** Bug 1832801 has been marked as a duplicate of this bug. ***

Comment 4 Martin Pitt 2020-05-07 10:17:06 UTC
I just filed bug 1832801 when investigating this, with a detailled symbolic stack trace as well. It looks a bit different in the augeas details, but it's still in the virConnectListAllInterfaces() → augeas code path. Perhaps the trace is useful. I can reproduce this at will, it happens 100% on my system with our test.

Comment 5 Daniel Berrangé 2020-05-07 10:19:55 UTC
FYI, we believe we understand the problem & required changes at a high level:  https://www.redhat.com/archives/libvir-list/2020-April/msg01328.html

Comment 6 Daniel Berrangé 2020-05-07 10:24:21 UTC
BTW to set expectations - this is not going to be an easy fix, and it isn't certain we'll be backporting any fix to stable Fedora.  So if this is impacting CI testing, I'd recommend attempting to workaround it by not shutting down libvirtd immediately after starting - give it 5 seconds of running before shutting it down.

Comment 7 Martin Pitt 2020-05-07 10:31:21 UTC
Thanks Daniel! Is there a more specific thing that we can wait for rather than just a static "sleep 5"? That's not going to be very reliable in our busy CI machines. I. e. is there some meaningful waiting loop with virsh or busctl on libvirt-dbus perhaps?

Comment 8 Daniel Berrangé 2020-05-07 10:37:53 UTC
You can use "virsh uri" as a liveliness test to validate that it is finished starting up. There is still a race in shutdown wrt currently running API calls that is shown in the stack traces.

The trace shows it is still executing an API call at the point where you are telling it to shutdown. The stack trace shows:

                         #14 0x00007f4f6d8ce38f virConnectListAllInterfaces (libvirt.so.0 + 0x32838f)

this is an API call that is in progress by one of the clients you have connected. Meanwhile something else has told libvirtd to shutdown.

So ideally you would fix the CI to not try to shutdown libvirtd while you are still waiting for an API call to finish.

Comment 9 Martin Pitt 2020-05-07 11:06:17 UTC
I am experimenting with calling "virsh domifaddr 1" before stopping, to wait until that code path finished. I have no real idea what I am doing of course, but at least the test succeeded 7 times in a row locally. I have thrown it against the CI wall now: https://github.com/cockpit-project/cockpit/pull/14043

If that works, it would be a robust enough workaround for this. Thanks Daniel for your fast reply!

Comment 10 Fedora Program Management 2021-04-29 16:19:36 UTC
This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 11 Ben Cotton 2021-05-25 16:00:35 UTC
Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.