Libvirt client commands such as `virsh`, `virt-manager` stop working after a while and hang indefinitely. I think I have traced the problem to the way systemd determines whether a service is active or not. It seems to check whether there is a single surviving process in the spawned CGroup. Just the main process dying is apparently not enough. In this context this leads to the following bug: Systemd starts virtnetworkd with "--timeout 120", so after two minutes virtnetworkd shuts down. However, virtnetworkd starts in most cases dnsmasq processes which survive as they need to keep serving their virtual networks. Due to these dnsmasq processes systemd considers the virtnetworkd.service as still active. So when now clients (virsh, virt-manager) connect to the unix sockets systemd wont restart virtnetworkd and the clients hang indefinitely. Commands that solely access the other libvirt services keep working in this situation, e.g. a `virsh list` works. Reproducible: Always Steps to Reproduce: 1. Install a fresh Fedora 38 Everything Netinst into a VM with "Minimal System" with latest updates from late day 08.06.2023 2. Make a `dnf install bash-completion @virtualization` (bash-completion is for convenience) 3. Reboot so that systemd properly picks up the libvirt service files 4. type `virsh net-list` to start up virtnetworkd.service - this command should work 5. quickly check systemd service status: `systemctl status virtnetworkd.service` - should be active and the virtnetworkd process should still be alive 6. wait 2 minutes or kill virtnetworkd 7. check systemd service status again: `systemctl status virtnetworkd.service` Note that the service is "active", but virtnetworkd is dead, only the dnsmasq process(es) maintain the service in the active state 8. run `virsh net-list` - it will hang (maybe you want to run it in a separate terminal for watching the remedy in the next step live) 9. kill all dnsmasq processes that keep the service in the active state => service becomes inactive => clients can now trigger systemd into starting the virtnetworkd again by connecting to the unix sockets, all hanging clients should come alive and successfully finish their tasks Actual Results: After two minutes of idle time the virtnetworkd.service becomes unresponsive and all libvirt clients hang when trying to do anything related to the virtnetworkd. Expected Results: Continuous availability of the virtnetworkd.service.
The relevant package versions are: - systemd-253.5.1.fc38 - libvirt-daemon-driver-network-9.0.0.3.fc38.
https://github.com/systemd/systemd/issues/27953
restarting virtnetworkd vis `sudo systemctl restart virtnetworkd` seems to get things unstuck; supporting this theory.
Workaround is presumably to override the timeout in /usr/lib/systemd/system/virtnetworkd.service by adding the file: # cat /etc/sysconfig/virtnetworkd VIRTNETWORKD_ARGS=
(In reply to cagney from comment #3) > restarting virtnetworkd vis `sudo systemctl restart virtnetworkd` seems to > get things unstuck; supporting this theory. As far as systemd is concerned, the virtnetworkd.service is still active after the daemon itself has timed out and terminate. This is because the dnsmasq programs are still active and that keeps the unit active. If you kill the dnsmasq programs, the unit will then terminate and restarted. Restarting virtnetworkd.service has the same effect. I don't know what has changed because the dnsmasq has always been kept running when the the libvirt* or virt* daemons has timed out and terminated. And that has never been a problem before.
*** Bug 2213584 has been marked as a duplicate of this bug. ***
(In reply to cagney from comment #3) > restarting virtnetworkd vis `sudo systemctl restart virtnetworkd` seems to > get things unstuck; supporting this theory. This only works for about 2 minutes after which everything is stuck again.
That's not been my experience. I can restart that service once virsh has gotten stuck or beforehand and it eliminates the problem. This is true even hours later when I will issue "virsh shutdown <domain>" rather than bother to interact with the UI. The one thing that could be different in my case is that my laptop is typically on wired network, and I run a bridge in my NM config so that I can have the VM on it's own network connetion (not NAT). This messes up the WiFi use case, but then I'd switch the VM to use NAT instead. YMMV, but I wanted to offer a bit of network context in case perhaps being on WiFi and the network state changing could retrigger the issue. That's not something I'd typically have happen in my situation.
I hit this with vagrant libvirt on F38, I think it's the same issue as described by others here. It is using rubygem-ruby-libvirt, that is calling C-API when using `qemu:///system` connection and trying to list all networks. Then it hangs. Unfortunately vagrant-libvirt is using that call quite a lot, so vagrant with libvirt is next to unusable without baby-sitting and restarting the systemd service manually when needed. This issues is not reproducible when doing the equal calls against `qemu:///session` (though that returns empty list, even in CLI). `virsh` equivalent to the API calls: ``` $ virsh -c qemu:///system net-list --all ``` This happens also on rawhide.
redhat wrote: > > restarting virtnetworkd vis `sudo systemctl restart virtnetworkd` seems to > > get things unstuck; supporting this theory. > This only works for about 2 minutes after which everything is stuck again. That isn't surprising, it would work for 60 seconds, but that's enough when there's only one VM. A confirmed workaround is: https://bugzilla.redhat.com/show_bug.cgi?id=2213660#c4 which, to give fair credit, I got it from Daniel Berrangé here https://bugzilla.redhat.com/show_bug.cgi?id=2075736#c12
(In reply to Cagney from comment #11) > redhat wrote: > > > restarting virtnetworkd vis `sudo systemctl restart virtnetworkd` seems to > > > get things unstuck; supporting this theory. > > > This only works for about 2 minutes after which everything is stuck again. > > That isn't surprising, it would work for 60 seconds, but that's enough when > there's only one VM. A confirmed workaround is: > https://bugzilla.redhat.com/show_bug.cgi?id=2213660#c4 > which, to give fair credit, I got it from Daniel Berrangé here > https://bugzilla.redhat.com/show_bug.cgi?id=2075736#c12 Yup, I applied this fix after posting my comment. Works fine for now. Let's hope the systemd patch (https://github.com/systemd/systemd/issues/27953) will quickly be merged and built.
Hmm... so this is potentially a dup of 2075736 as well? I'll try the workaround from there (already applied but will be moving to office soon, so I'll get a natural restart) and see if it clears up. The notes from 2075736 suggest not everyone saw it as a magic fix though, and without tracing through them all it seem to have spawned at least one or two other reports in other modules.
(In reply to Bill Taroli from comment #13) > Hmm... so this is potentially a dup of 2075736 as well? No. But the workaround is the same. bug 2075736 is about a catatonic (re)start delay when there were lots of networks; here the daemon never restarts
*** Bug 2215823 has been marked as a duplicate of this bug. ***
(In reply to Toolybird from comment #2) > https://github.com/systemd/systemd/issues/27953 It looks like upstream systemd will fix this problem
Fix committed here: https://github.com/systemd/systemd/pull/28000 That chain mentions that the causal commit never hit a stable release, so does that mean fedora can pick up the commit directly? Also, it seems this bug is still assigned to libvirt, shouldn't that be moved to systemd? See also https://github.com/systemd/systemd-stable/issues/302.
yup, moving to systemd ...
https://koji.fedoraproject.org/koji/taskinfo?taskID=102415123 is a systemd scratch build with the fix/revert applied, seems to fix the problem for me
For me, the decision was to turn off the --timeout parameter in libvirtd: echo LIBVIRTD_ARGS= > /etc/sysconfig/libvirtd && systemctl restart libvirtd
The scratch build has been working here overnight, and no problems so far. Seems to have fixed it for me also.
The scratch build seems to work here as well.
*** Bug 2213257 has been marked as a duplicate of this bug. ***
Can the f38 package be patched for this until systemd-253.6 is released?
(In reply to Jens Petersen from comment #24) > Can the f38 package be patched for this until systemd-253.6 is released? I second this, I was about to open a bug to request this :)
(In reply to Christophe Fergeau from comment #25) > (In reply to Jens Petersen from comment #24) > > Can the f38 package be patched for this until systemd-253.6 is released? > > I second this, I was about to open a bug to request this :) Sure, if there's a simple libvirt workaround. or systemd backports the upstream revert? some podman usecase is affected too: https://github.com/containers/podman/issues/18862
Yes I meant systemd
Same here, I was also hoping for a backport of the upstream systemd revert. From the libvirt side, users can workaround this with `systemctl enable libvirtd.service` (no idea about the equivalent in modular libvirt), which is not very nice but does the job.
(In reply to Christophe Fergeau from comment #28) > Same here, I was also hoping for a backport of the upstream systemd revert. > From the libvirt side, users can workaround this with `systemctl enable > libvirtd.service` (no idea about the equivalent in modular libvirt), which > is not very nice but does the job. You either need to -- disable timeout as defined in Environment=LIBVIRTD_ARGS="--timeout 120" to make sure the main process doesn't time out after 120 seconds. -- change the KillMode=process to make sure dnsmasq is terminated when the main process is terminated after timeout.
(In reply to Villy Kruse from comment #29) > You either need to > -- disable timeout as defined in Environment=LIBVIRTD_ARGS="--timeout 120" > to make sure the main process doesn't time out after 120 seconds. > -- change the KillMode=process to make sure dnsmasq is terminated when the > main process is terminated after timeout. Libvirt already has KillMode=process set in its unit files - this systemd bug broke its KillMode=process logic. Disabling the timeout is the only viable libvirt side workaround AFAIK
FEDORA-2023-d5c6ec6551 has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2023-d5c6ec6551
FEDORA-2023-eb0fed38ad has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2023-eb0fed38ad
FEDORA-2023-eb0fed38ad has been pushed to the Fedora 39 stable repository. If problem still persists, please make note of it in this bug report.
Can we have 253.6 for F38?
FEDORA-2023-b07a6a9665 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-b07a6a9665
253.7-1.fc38 has resolved the problem (for me) ... Thank you !
FEDORA-2023-b07a6a9665 has been pushed to the Fedora 38 stable repository. If problem still persists, please make note of it in this bug report.