Bug 2213660 - libvirt clients hang because virtnetworkd.service misses when virtnetworkd is dead
Summary: libvirt clients hang because virtnetworkd.service misses when virtnetworkd is...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 38
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 2213257 2213584 2215823 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-08 22:17 UTC by Michael Riss
Modified: 2023-08-16 13:15 UTC (History)
57 users (show)

Fixed In Version: systemd-254~rc2-1.fc39 systemd-253.7-1.fc38
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-19 03:14:18 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github systemd systemd pull 28000 0 None Merged Revert "core/service: when resetting PID also reset known flag" 2023-06-26 12:16:30 UTC

Description Michael Riss 2023-06-08 22:17:58 UTC
Libvirt client commands such as `virsh`, `virt-manager` stop working after a while and hang indefinitely.


I think I have traced the problem to the way systemd determines whether a service is active or not. It seems to check whether there is a single surviving process in the spawned CGroup. Just the main process dying is apparently not enough.

In this context this leads to the following bug:
Systemd starts virtnetworkd with "--timeout 120", so after two minutes virtnetworkd shuts down. However, virtnetworkd starts in most cases dnsmasq processes which survive as they need to keep serving their virtual networks. Due to these dnsmasq processes systemd considers the virtnetworkd.service as still active. So when now clients (virsh, virt-manager) connect to the unix sockets systemd wont restart virtnetworkd and the clients hang indefinitely.

Commands that solely access the other libvirt services keep working in this situation, e.g. a `virsh list` works.

Reproducible: Always

Steps to Reproduce:
1. Install a fresh Fedora 38 Everything Netinst into a VM with "Minimal System" with latest updates from late day 08.06.2023
2. Make a `dnf install bash-completion @virtualization` (bash-completion is for convenience)
3. Reboot so that systemd properly picks up the libvirt service files
4. type `virsh net-list` to start up virtnetworkd.service - this command should work
5. quickly check systemd service status: `systemctl status virtnetworkd.service` - should be active and the virtnetworkd process should still be alive
6. wait 2 minutes or kill virtnetworkd
7. check systemd service status again: `systemctl status virtnetworkd.service`
   Note that the service is "active", but virtnetworkd is dead, only the dnsmasq process(es) maintain the service in the active state
8. run `virsh net-list` - it will hang (maybe you want to run it in a separate terminal for watching the remedy in the next step live)
9. kill all dnsmasq processes that keep the service in the active state
   => service becomes inactive
   => clients can now trigger systemd into starting the virtnetworkd again by connecting to the unix sockets, all hanging clients should come alive and successfully finish their tasks

Actual Results:  
After two minutes of idle time the virtnetworkd.service becomes unresponsive and all libvirt clients hang when trying to do anything related to the virtnetworkd.

Expected Results:  
Continuous availability of the virtnetworkd.service.

Comment 1 Michael Riss 2023-06-08 22:48:50 UTC
The relevant package versions are:
- systemd-253.5.1.fc38 
- libvirt-daemon-driver-network-9.0.0.3.fc38.

Comment 3 Cagney 2023-06-09 13:38:17 UTC
restarting virtnetworkd vis `sudo systemctl restart virtnetworkd` seems to get things unstuck; supporting this theory.

Comment 4 Cagney 2023-06-09 14:02:02 UTC
Workaround is presumably to override the timeout in /usr/lib/systemd/system/virtnetworkd.service by adding the file:

  # cat /etc/sysconfig/virtnetworkd
  VIRTNETWORKD_ARGS=

Comment 5 Villy Kruse 2023-06-09 20:45:37 UTC
(In reply to cagney from comment #3)
> restarting virtnetworkd vis `sudo systemctl restart virtnetworkd` seems to
> get things unstuck; supporting this theory.

As far as systemd is concerned, the virtnetworkd.service is still active after the daemon itself has timed out and terminate.  This is because the dnsmasq programs are still active and that keeps the unit active.

If you kill the dnsmasq programs, the unit will then terminate and restarted.  Restarting virtnetworkd.service has the same effect.

I don't know what has changed because the dnsmasq has always been kept running when the the libvirt* or virt* daemons has timed out and terminated. And that has never been a problem before.

Comment 6 Bill Taroli 2023-06-12 18:20:17 UTC
*** Bug 2213584 has been marked as a duplicate of this bug. ***

Comment 7 redhat 2023-06-15 06:49:18 UTC
(In reply to cagney from comment #3)
> restarting virtnetworkd vis `sudo systemctl restart virtnetworkd` seems to
> get things unstuck; supporting this theory.

This only works for about 2 minutes after which everything is stuck again.

Comment 8 Bill Taroli 2023-06-15 07:06:54 UTC
That's not been my experience. I can restart that service once virsh has gotten stuck or beforehand and it eliminates the problem. This is true even hours later when I will issue "virsh shutdown <domain>" rather than bother to interact with the UI.

The one thing that could be different in my case is that my laptop is typically on wired network, and I run a bridge in my NM config so that I can have the VM on it's own network connetion (not NAT). This messes up the WiFi use case, but then I'd switch the VM to use NAT instead.

YMMV, but I wanted to offer a bit of network context in case perhaps being on WiFi and the network state changing could retrigger the issue. That's not something I'd typically have happen in my situation.

Comment 9 Jarek Prokop 2023-06-15 07:45:25 UTC
I hit this with vagrant libvirt on F38, I think it's the same issue as described by others here.

It is using rubygem-ruby-libvirt, that is calling C-API when using `qemu:///system` connection and trying to list all networks. Then it hangs.

Unfortunately vagrant-libvirt is using that call quite a lot, so vagrant with libvirt is next to unusable without baby-sitting and restarting the systemd service manually when needed.

This issues is not reproducible when doing the equal calls against `qemu:///session` (though that returns empty list, even in CLI).

`virsh` equivalent to the API calls:
```
$ virsh -c qemu:///system net-list --all
```

This happens also on rawhide.

Comment 11 Cagney 2023-06-15 10:59:27 UTC
redhat wrote:
> > restarting virtnetworkd vis `sudo systemctl restart virtnetworkd` seems to
> > get things unstuck; supporting this theory.

> This only works for about 2 minutes after which everything is stuck again.

That isn't surprising, it would work for 60 seconds, but that's enough when there's only one VM.  A confirmed workaround is:
  https://bugzilla.redhat.com/show_bug.cgi?id=2213660#c4
which, to give fair credit, I got it from Daniel Berrangé here https://bugzilla.redhat.com/show_bug.cgi?id=2075736#c12

Comment 12 redhat 2023-06-15 12:40:21 UTC
(In reply to Cagney from comment #11)
> redhat wrote:
> > > restarting virtnetworkd vis `sudo systemctl restart virtnetworkd` seems to
> > > get things unstuck; supporting this theory.
> 
> > This only works for about 2 minutes after which everything is stuck again.
> 
> That isn't surprising, it would work for 60 seconds, but that's enough when
> there's only one VM.  A confirmed workaround is:
>   https://bugzilla.redhat.com/show_bug.cgi?id=2213660#c4
> which, to give fair credit, I got it from Daniel Berrangé here
> https://bugzilla.redhat.com/show_bug.cgi?id=2075736#c12

Yup, I applied this fix after posting my comment. Works fine for now.

Let's hope the systemd patch (https://github.com/systemd/systemd/issues/27953) will quickly be merged and built.

Comment 13 Bill Taroli 2023-06-15 13:42:17 UTC
Hmm... so this is potentially a dup of 2075736 as well?

I'll try the workaround from there (already applied but will be moving to office soon, so I'll get a natural restart) and see if it clears up. The notes from 2075736 suggest not everyone saw it as a magic fix though, and without tracing through them all it seem to have spawned at least one or two other reports in other modules.

Comment 14 Cagney 2023-06-15 14:00:26 UTC
(In reply to Bill Taroli from comment #13)
> Hmm... so this is potentially a dup of 2075736 as well?

No.  But the workaround is the same.

bug 2075736 is about a catatonic (re)start delay when there were lots of networks; here the daemon never restarts

Comment 15 alex 2023-06-20 03:18:38 UTC
*** Bug 2215823 has been marked as a duplicate of this bug. ***

Comment 16 Villy Kruse 2023-06-20 10:20:19 UTC
(In reply to Toolybird from comment #2)
> https://github.com/systemd/systemd/issues/27953

It looks like upstream systemd will fix this problem

Comment 17 Brandon 2023-06-21 02:13:27 UTC
Fix committed here:
https://github.com/systemd/systemd/pull/28000

That chain mentions that the causal commit never hit a stable release, so does that mean fedora can pick up the commit directly?

Also, it seems this bug is still assigned to libvirt, shouldn't that be moved to systemd?

See also https://github.com/systemd/systemd-stable/issues/302.

Comment 18 Dan Horák 2023-06-21 07:56:28 UTC
yup, moving to systemd ...

Comment 19 Dan Horák 2023-06-21 08:40:05 UTC
https://koji.fedoraproject.org/koji/taskinfo?taskID=102415123 is a systemd scratch build with the fix/revert applied, seems to fix the problem for me

Comment 20 Oleg Kochkin 2023-06-21 20:01:04 UTC
For me, the decision was to turn off the --timeout parameter in libvirtd:
echo LIBVIRTD_ARGS= > /etc/sysconfig/libvirtd && systemctl restart libvirtd

Comment 21 Ian Laurie 2023-06-21 21:29:34 UTC
The scratch build has been working here overnight, and no problems so far.  Seems to have fixed it for me also.

Comment 22 Nils Philippsen 2023-06-27 10:51:56 UTC
The scratch build seems to work here as well.

Comment 23 Dan Horák 2023-06-27 15:58:00 UTC
*** Bug 2213257 has been marked as a duplicate of this bug. ***

Comment 24 Jens Petersen 2023-07-03 11:11:28 UTC
Can the f38 package be patched for this until systemd-253.6 is released?

Comment 25 Christophe Fergeau 2023-07-04 08:12:32 UTC
(In reply to Jens Petersen from comment #24)
> Can the f38 package be patched for this until systemd-253.6 is released?

I second this, I was about to open a bug to request this :)

Comment 26 Cole Robinson 2023-07-05 18:07:41 UTC
(In reply to Christophe Fergeau from comment #25)
> (In reply to Jens Petersen from comment #24)
> > Can the f38 package be patched for this until systemd-253.6 is released?
> 
> I second this, I was about to open a bug to request this :)

Sure, if there's a simple libvirt workaround.

or systemd backports the upstream revert? some podman usecase is affected too: https://github.com/containers/podman/issues/18862

Comment 27 Jens Petersen 2023-07-06 04:39:16 UTC
Yes I meant systemd

Comment 28 Christophe Fergeau 2023-07-06 07:53:44 UTC
Same here, I was also hoping for a backport of the upstream systemd revert.
From the libvirt side, users can workaround this with `systemctl enable libvirtd.service` (no idea about the equivalent in modular libvirt), which is not very nice but does the job.

Comment 29 Villy Kruse 2023-07-06 08:04:37 UTC
(In reply to Christophe Fergeau from comment #28)
> Same here, I was also hoping for a backport of the upstream systemd revert.
> From the libvirt side, users can workaround this with `systemctl enable
> libvirtd.service` (no idea about the equivalent in modular libvirt), which
> is not very nice but does the job.

You either need to
 -- disable timeout as defined in Environment=LIBVIRTD_ARGS="--timeout 120" to make sure the main process doesn't time out after 120 seconds.
 -- change the KillMode=process to make sure dnsmasq is terminated when the main process is terminated after timeout.

Comment 30 Daniel Berrangé 2023-07-06 08:50:05 UTC
(In reply to Villy Kruse from comment #29)

> You either need to
>  -- disable timeout as defined in Environment=LIBVIRTD_ARGS="--timeout 120"
> to make sure the main process doesn't time out after 120 seconds.
>  -- change the KillMode=process to make sure dnsmasq is terminated when the
> main process is terminated after timeout.

Libvirt already has  KillMode=process set in its unit files - this systemd bug broke its KillMode=process logic.

Disabling the timeout is the only viable libvirt side workaround AFAIK

Comment 31 Fedora Update System 2023-07-13 10:29:38 UTC
FEDORA-2023-d5c6ec6551 has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2023-d5c6ec6551

Comment 32 Fedora Update System 2023-07-15 14:55:13 UTC
FEDORA-2023-eb0fed38ad has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2023-eb0fed38ad

Comment 33 Fedora Update System 2023-07-15 18:39:55 UTC
FEDORA-2023-eb0fed38ad has been pushed to the Fedora 39 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 34 Jens Petersen 2023-07-17 15:21:36 UTC
Can we have 253.6 for F38?

Comment 35 Fedora Update System 2023-07-18 07:12:49 UTC
FEDORA-2023-b07a6a9665 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-b07a6a9665

Comment 36 Christian Labisch 2023-07-18 10:40:33 UTC
253.7-1.fc38 has resolved the problem (for me) ... Thank you !

Comment 37 Fedora Update System 2023-07-19 03:14:18 UTC
FEDORA-2023-b07a6a9665 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.