2075736 – error: Disconnected from qemu:///system due to keepalive timeout

Bug 2075736 - error: Disconnected from qemu:///system due to keepalive timeout

Summary: error: Disconnected from qemu:///system due to keepalive timeout

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	libvirt
Sub Component:
Version:	37
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Libvirt Maintainers
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-15 03:41 UTC by Cagney
Modified:	2023-12-05 21:06 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2023-12-05 21:06:11 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Backtrace of /usr/sbin/virtqemud --timeout 120 (28.77 KB, text/plain) 2022-07-05 15:03 UTC, Cagney	no flags	Details
virtnetworkd backtrace when stuck (10.88 KB, text/plain) 2022-07-11 12:35 UTC, Cagney	no flags	Details
*backtrace of virqemud --timeout 120 when 2<<virsh ... start ...>> blocked** (12.65 KB, text/plain) 2022-07-21 17:23 UTC, Cagney	no flags	Details
virqemu blocking again (28.07 KB, text/plain) 2022-07-21 22:01 UTC, Cagney	no flags	Details
virtqemud when virt-install blocked (21.93 KB, text/plain) 2022-07-21 23:57 UTC, Cagney	no flags	Details
View All

Description Cagney 2022-04-15 03:41:50 UTC

Description of problem:

A local machine creating and deleting VMs using virsh, gets the intermittent error:

  error: Disconnected from qemu:///system due to keepalive timeout

(which is funny as there's no network).

Version-Release number of selected component (if applicable):

5.16.18-200.fc35.x86_64 #1 SMP PREEMPT Mon Mar 28 14:10:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
qemu-6.1.0-14.fc35.x86_64
libvirt-client-7.6.0-5.fc35.x86_64

How reproducible:

Intermittent.  Seemingly the more load the more likely.

Steps to Reproduce:

1. build a VM tree (15 VMs, lets say, with a qcow2 disk living in /tmp, pointing back to a common qcow2 image on SSD)
2. do the same 2 more times (so 45 domains?)
3. use virsh to boot and shutdown some subset of these domains in parallel


Additional info:



simple commands like <<virsh console>> can also hang

Comment 1 Daniel Berrangé 2022-04-19 11:16:33 UTC

This suggests that the libvirt daemon is not able to satisfy the keepalive requests from the libvirt clients.

I'm unclear exactly how many operations you have going on in parallel here, but it is possible you're exceeding the default maximum workers limit (which is '20') for long enough to trigger keepalive if the VM startup is slow.

Try reconfiguring 'max_workers = 100'  in /etc/libvirt/libvirtd.conf  (or virqemud.conf if your install is using virtqemud instead of libvirtd)

Comment 2 Cagney 2022-06-24 17:26:21 UTC

Sorry, missed reply.

At times I would easily exceed 20 things happening at once.  So I'll try a major increase on #workers.

I also gather that virtqemud is now preferred over libvirtd.  I'll switch to that.

(without these changes on libvirt-client-8.1.0-2.fc36.x86_64 everything seems to take a second, will see if above helps).

Comment 3 Cagney 2022-07-05 13:48:46 UTC

Unfortunately, no luck.

While after a fresh boot things are ok; but after a few days of machine use I'm back to seeing:

time sudo virsh --connect=qemu:///system destroy 'n1west'
real	0m49.720s
user	0m0.016s
sys	0m0.013s

where n1west is the only vm running

Comment 4 Cagney 2022-07-05 14:30:02 UTC

Thread 2 (Thread 0x7f71fe7ff640 (LWP 2348627) "vshEventLoop"):
#0  0x00007f720d505bef in __GI___poll (fds=0x7f71f8004fd0, nfds=1, timeout=5000) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f720d7e423d in g_main_context_poll (priority=<optimized out>, n_fds=1, fds=0x7f71f8004fd0, timeout=<optimized out>, context=0x7f71f8001cb0) at ../glib/gmain.c:4516
#2  g_main_context_iterate.constprop.0 (context=context@entry=0x7f71f8001cb0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4206
#3  0x00007f720d78c940 in g_main_context_iteration (context=0x7f71f8001cb0, context@entry=0x0, may_block=may_block@entry=1) at ../glib/gmain.c:4276
#4  0x00007f720daf9bf4 in virEventGLibRunOnce () at ../src/util/vireventglib.c:512
#5  0x000055dec201f665 in vshEventLoop (opaque=0x7ffe7b734040) at ../tools/vsh.c:2029
#6  0x00007f720db5778b in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:256
#7  0x00007f720d48ce2d in start_thread (arg=<optimized out>) at pthread_create.c:442
#8  0x00007f720d512620 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f720bee7ac0 (LWP 2348626) "virsh"):
#0  0x00007f720d505bef in __GI___poll (fds=0x55dec2198fe0, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f720d7e423d in g_main_context_poll (priority=<optimized out>, n_fds=2, fds=0x55dec2198fe0, timeout=<optimized out>, context=0x55dec219e1f0) at ../glib/gmain.c:4516
#2  g_main_context_iterate.constprop.0 (context=0x55dec219e1f0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4206
#3  0x00007f720d78e6cf in g_main_loop_run (loop=0x55dec219e2e0) at ../glib/gmain.c:4411
#4  0x00007f720dc15710 in virNetClientIOEventLoop (client=0x55dec219f010, thiscall=0x55dec218ed40) at ../src/rpc/virnetclient.c:1724
#5  0x00007f720dc170b1 in virNetClientIO (thiscall=0x55dec218ed40, client=0x55dec219f010) at ../src/rpc/virnetclient.c:1992
#6  virNetClientSendInternal (client=client@entry=0x55dec219f010, msg=msg@entry=0x55dec21a0b30, expectReply=expectReply@entry=true, nonBlock=nonBlock@entry=false) at ../src/rpc/virnetclient.c:2162
#7  0x00007f720dc173e7 in virNetClientSendWithReply (client=0x55dec219f010, msg=0x55dec21a0b30) at ../src/rpc/virnetclient.c:2190
#8  0x00007f720dc1194a in virNetClientProgramCall (prog=<optimized out>, client=0x55dec219f010, serial=7, proc=12, noutfds=<optimized out>, outfds=<optimized out>, ninfds=0x0, infds=0x0, args_filter=0x7f720dc4cc30 <xdr_remote_domain_destroy_args>, args=0x7ffe7b733ea0, ret_filter=0x7f720ce55220 <xdr_void>, ret=0x0) at ../src/rpc/virnetclientprogram.c:322
#9  0x00007f720dd62f62 in callFull (conn=<optimized out>, ret=0x0, ret_filter=0x7f720ce55220 <xdr_void>, args=0x7ffe7b733ea0 "\020\240\031\302\336U", args_filter=0x7f720dc4cc30 <xdr_remote_domain_destroy_args>, proc_nr=12, fdoutlen=0x0, fdout=0x0, fdinlen=0, fdin=0x0, flags=0, priv=0x55dec2198100) at ../src/remote/remote_driver.c:6554
#10 call.isra.0 (priv=priv@entry=0x55dec2198100, flags=flags@entry=0, proc_nr=proc_nr@entry=12, args_filter=0x7f720dc4cc30 <xdr_remote_domain_destroy_args>, args=args@entry=0x7ffe7b733ea0 "\020\240\031\302\336U", ret_filter=0x7f720ce55220 <xdr_void>, ret=0x0, conn=<optimized out>) at ../src/remote/remote_driver.c:6576
#11 0x00007f720dc6dcc2 in remoteDomainDestroy (dom=0x55dec219b500) at src/remote/remote_client_bodies.h:2515
#12 0x00007f720dd18996 in virDomainDestroy (domain=domain@entry=0x55dec219b500) at ../src/libvirt-domain.c:492
#13 0x000055dec1fed5e0 in cmdDestroy (ctl=0x7ffe7b734040, cmd=0x55dec2190690) at ../tools/virsh-domain.c:8263
#14 0x000055dec201e20c in vshCommandRun (ctl=0x7ffe7b734040, cmd=0x55dec2190690) at ../tools/vsh.c:1308
#15 0x000055dec1fd03a0 in main (argc=<optimized out>, argv=<optimized out>) at ../tools/virsh.c:904

Comment 5 Daniel Berrangé 2022-07-05 14:36:11 UTC

It is rarely the client side that's the problem here. If you can get a stack trace on the virtqemud daemon, across all threads, that might be more useful.

Comment 6 Cagney 2022-07-05 14:49:49 UTC

Which I was just doing :-)  Here's the thread, in libqemud that appears to be handling the request:

Thread 2 (Thread 0x7f05c53ff640 (LWP 2359130) "rpc-virtqemud"):
#0  0x00007f05d3f05bef in __GI___poll (fds=0x7f05bc005800, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f05d436e23d in g_main_context_poll (priority=<optimized out>, n_fds=2, fds=0x7f05bc005800, timeout=<optimized out>, context=0x7f05bc052320) at ../glib/gmain.c:4516
#2  g_main_context_iterate.constprop.0 (context=0x7f05bc052320, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4206
#3  0x00007f05d43186cf in g_main_loop_run (loop=0x7f05bc055030) at ../glib/gmain.c:4411
#4  0x00007f05d4615710 in virNetClientIOEventLoop (client=0x7f05bc05b400, thiscall=0x7f05bc05dc40) at ../src/rpc/virnetclient.c:1724
#5  0x00007f05d46170b1 in virNetClientIO (thiscall=0x7f05bc05dc40, client=0x7f05bc05b400) at ../src/rpc/virnetclient.c:1992
#6  virNetClientSendInternal (client=client@entry=0x7f05bc05b400, msg=msg@entry=0x7f05bc058610, expectReply=expectReply@entry=true, nonBlock=nonBlock@entry=false) at ../src/rpc/virnetclient.c:2162
#7  0x00007f05d46173e7 in virNetClientSendWithReply (client=0x7f05bc05b400, msg=0x7f05bc058610) at ../src/rpc/virnetclient.c:2190
#8  0x00007f05d461194a in virNetClientProgramCall (prog=<optimized out>, client=0x7f05bc05b400, serial=0, proc=66, noutfds=<optimized out>, outfds=<optimized out>, ninfds=0x0, infds=0x0, args_filter=0x7f05d496f220 <xdr_void>, args=0x0, ret_filter=0x7f05d465b5a0 <xdr_remote_auth_list_ret>, ret=0x7f05c53fe490) at ../src/rpc/virnetclientprogram.c:322
#9  0x00007f05d4762f62 in callFull (conn=<optimized out>, ret=0x7f05c53fe490 "", ret_filter=0x7f05d465b5a0 <xdr_remote_auth_list_ret>, args=0x0, args_filter=0x7f05d496f220 <xdr_void>, proc_nr=66, fdoutlen=0x0, fdout=0x0, fdinlen=0, fdin=0x0, flags=<optimized out>, priv=0x7f05bc01df20) at ../src/remote/remote_driver.c:6554
#10 call.isra.0 (priv=0x7f05bc01df20, flags=<optimized out>, proc_nr=66, args_filter=0x7f05d496f220 <xdr_void>, args=0x0, ret_filter=0x7f05d465b5a0 <xdr_remote_auth_list_ret>, ret=0x7f05c53fe490 "", conn=<optimized out>) at ../src/remote/remote_driver.c:6576
#11 0x00007f05d466440d in remoteAuthenticate (authtype=0x0, auth=<optimized out>, priv=0x7f05bc01df20, conn=0x7f05b4009350) at ../src/remote/remote_driver.c:3577
#12 doRemoteOpen (conn=0x7f05b4009350, priv=0x7f05bc01df20, driver_str=<optimized out>, transport=<optimized out>, auth=<optimized out>, conf=<optimized out>, flags=<optimized out>) at ../src/remote/remote_driver.c:1129
#13 0x00007f05d4666fb0 in remoteConnectOpen (conn=<optimized out>, auth=0x0, conf=0x7f05bc0178b0, flags=<optimized out>) at ../src/remote/remote_driver.c:1273
#14 0x00007f05d4716814 in virConnectOpenInternal (name=<optimized out>, auth=0x0, flags=0) at ../src/libvirt.c:1132
#15 0x00007f05d4716f94 in virConnectOpen (name=name@entry=0x7f05bc04fd10 "network:///system") at ../src/libvirt.c:1208
#16 0x00007f05d471728d in virGetConnectGeneric (threadPtr=<optimized out>, name=0x7f05d4783be6 "network") at ../src/driver.c:155
#17 0x00007f05c1350235 in qemuProcessStop (driver=0x7f05780226a0, vm=0x7f05788b3910, reason=VIR_DOMAIN_SHUTOFF_DESTROYED, asyncJob=QEMU_ASYNC_JOB_NONE, flags=<optimized out>) at ../src/qemu/qemu_process.c:8230
#18 0x00007f05c12c6bf2 in qemuDomainDestroyFlags (dom=<optimized out>, flags=<optimized out>) at ../src/qemu/qemu_driver.c:2117
#19 0x00007f05d4718996 in virDomainDestroy (domain=domain@entry=0x7f0590001b30) at ../src/libvirt-domain.c:492
#20 0x0000555872bbba8e in remoteDispatchDomainDestroy (server=0x555873eca880, msg=0x555873ec3420, args=<optimized out>, rerr=0x7f05c53fe970, client=0x555873ed7520) at src/remote/remote_daemon_dispatch_stubs.h:5326
#21 remoteDispatchDomainDestroyHelper (server=0x555873eca880, client=0x555873ed7520, msg=0x555873ec3420, rerr=0x7f05c53fe970, args=<optimized out>, ret=0x0) at src/remote/remote_daemon_dispatch_stubs.h:5305
#22 0x00007f05d461ae1f in virNetServerProgramDispatchCall (msg=0x555873ec3420, client=0x555873ed7520, server=0x555873eca880, prog=0x555873ecbc10) at ../src/rpc/virnetserverprogram.c:428
#23 virNetServerProgramDispatch (prog=0x555873ecbc10, server=0x555873eca880, client=0x555873ed7520, msg=0x555873ec3420) at ../src/rpc/virnetserverprogram.c:302
#24 0x00007f05d461b332 in virNetServerProcessMsg (srv=srv@entry=0x555873eca880, client=0x555873ed7520, prog=<optimized out>, msg=0x555873ec3420) at ../src/rpc/virnetserver.c:140
#25 0x00007f05d461b441 in virNetServerHandleJob (jobOpaque=0x555873edffe0, opaque=0x555873eca880) at ../src/rpc/virnetserver.c:160
#26 0x00007f05d455827d in virThreadPoolWorker (opaque=<optimized out>) at ../src/util/virthreadpool.c:164
#27 0x00007f05d455778b in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:256
#28 0x00007f05d3e8ce2d in start_thread (arg=<optimized out>) at pthread_create.c:442
#29 0x00007f05d3f12620 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

and ...

Comment 7 Cagney 2022-07-05 15:03:10 UTC

Created attachment 1894706 [details]
Backtrace of /usr/sbin/virtqemud --timeout 120

Full backtrace of virtqemu trying to destroy a domain (note this is from a separate run).

Comment 8 Daniel Berrangé 2022-07-05 16:59:57 UTC

This is the key line:

#15 0x00007f05d4716f94 in virConnectOpen (name=name@entry=0x7f05bc04fd10 "network:///system") at ../src/libvirt.c:1208

It is trying to connect to virtnetworkd and for some reason getting stuck waiting to get a list of auth methods.

This is peculiar because when virtqemud is running as root, the default authentication by virtnetworkd should short-circuit and report that no auth is even needed.

Does 'virsh -c network:///system net-list' work, or does it also get stuck. i guess if we open a connection to virtnetworkd on every VM shutdown, you might need to increase 'max_workers' for vitnetworkd.conf too, to allow greater concurrency.

Comment 9 Cagney 2022-07-05 18:29:13 UTC

(I checked, yes virtnetworkd is running)

I rebooted the machine.  It will be a day or so before the symptoms return and I can work through your suggestions.

Two things immediately worth noting:
- there are a lot of networks:
  $ time sudo virsh -c network:///system net-list | wc -l
  371
  real	0m0.164s
  user	0m0.106s
  sys	0m0.047s
  (remember this is after a fresh reboot so things are working)
- while the M/C has lots of VMs the hang was happening with only one running

Comment 10 Daniel Berrangé 2022-07-06 07:52:14 UTC

Wow, that's an impressively large number of virtual networks. It might be worth grabbing a stack trace of virtnetworkd next time this happens too.

Comment 11 Cagney 2022-07-11 12:35:07 UTC

Created attachment 1896053 [details]
virtnetworkd backtrace when stuck

This one seems different, I guess next time I get to backtrace what ever is running:

Thread 17 (Thread 0x7f66055f7640 (LWP 1213975) "daemon-init"):
#0  0x00007f6620305bef in poll () at /lib64/libc.so.6
#1  0x00007f66208fadbb in virCommandProcessIO () at /lib64/libvirt.so.0
#2  0x00007f66208fee32 in virCommandRun () at /lib64/libvirt.so.0
#3  0x00007f662090f58e in virFirewallApplyRule () at /lib64/libvirt.so.0
#4  0x00007f662090f9e1 in virFirewallApply () at /lib64/libvirt.so.0
#5  0x00007f661ed00f1e in networkRemoveFirewallRules () at /usr/lib64/libvirt/connection-driver/libvirt_driver_network.so
#6  0x00007f661ed00fd8 in networkReloadFirewallRulesHelper () at /usr/lib64/libvirt/connection-driver/libvirt_driver_network.so
#7  0x00007f66209d94ee in virNetworkObjListForEachHelper () at /lib64/libvirt.so.0
#8  0x00007f6620910e9d in virHashForEachSafe () at /lib64/libvirt.so.0
#9  0x00007f66209e4285 in virNetworkObjListForEach () at /lib64/libvirt.so.0
#10 0x00007f661ecff904 in networkReloadFirewallRules () at /usr/lib64/libvirt/connection-driver/libvirt_driver_network.so
#11 0x00007f661ecffd2e in networkStateInitialize () at /usr/lib64/libvirt/connection-driver/libvirt_driver_network.so
#12 0x00007f6620b15f81 in virStateInitialize () at /lib64/libvirt.so.0
#13 0x0000562aaa4181c7 in daemonRunStateInit ()
#14 0x00007f662095778b in virThreadHelper () at /lib64/libvirt.so.0
#15 0x00007f662028ce2d in start_thread () at /lib64/libc.so.6
#16 0x00007f6620312620 in clone3 () at /lib64/libc.so.6

Comment 12 Daniel Berrangé 2022-07-13 09:43:01 UTC

Hmmm, interesting trace - this is during startup of virtnetworkd, when it re-creates all the firewall rules for active virtual networks. With 371 virtual networks I imagine this will take some time. I don't quite understand how it would trigger the keepalive yet, but this feels relevant.

Also because we switched the daemons to be socket activated, and auto-shutdown after idle, the virtnetworkd could be repeatedly starting up and shutting down, meaning you hit any startup delay repeatedly.

Two possible workarounds:

 * Create  /etc/sysconfig/virtnetworkd  and set VIRTNETWORKD_ARGS=""  and this ought to stop virtnetworkd getting the '--timeout' arg added, so it stays running
 * Edit /etc/libvirt/virtqemud.conf and set keepalive_interval=-1  (or /etc/libvirt/virtnetworkd.conf, if it turns out the keepalive errors are just being propagated back  from virtnetworkd, though I don't think that's the case)

As a real fix, I think we might want to disable auto-shutdown of virtnetworkd when any virtual networks are active, as we really don't want that startup overhead every time someone connects to it.

Comment 13 Cagney 2022-07-18 14:39:20 UTC

(In reply to Daniel Berrangé from comment #12)

> Also because we switched the daemons to be socket activated, and
> auto-shutdown after idle, the virtnetworkd could be repeatedly starting up
> and shutting down, meaning you hit any startup delay repeatedly.

This fits the behaviour I'm seeing - first command in the morning is slow.  I'll try the (second) workaround (first).

Comment 14 Cagney 2022-07-18 18:53:22 UTC

FYI, adding keepalive=-1 to /etc/libvirt/virtqemud.conf doesn't seem to have helped.  virt-install is still really slow (2 minutes to create/boot a VM).  Trying alternatives.

To re-cap the issues:

- taking a long time to do the first thing - working theory is virtnetworkd restarting and spending a lot of time re-creating firewall rules
Could I disable the re-creation of firewall rules?  Partially as a work-around and partially to confirm.

- taking ~1s to do things like delete domains (which is normally instant)

Comment 15 Daniel Berrangé 2022-07-19 08:24:09 UTC

Opps, I should have been clearer - setting keepalive==-1 wouldn't speed up the commands, but it should at least have got rid of the "connection closed due to keepalive" errors. IOW, made it slow, but reliable.

To make it not slow, I think the option of removing the '--timeout' arg from virtnetworkd will be the only immediate workaround

Comment 16 Cagney 2022-07-21 15:56:30 UTC

FYI, even with:

# cat /etc/sysconfig/virtnetworkd 
VIRTNETWORKD_ARGS=""

I'm still seeing commands like:

sudo virt-install \
	--connect=qemu:///system --check=path_in_use=off --graphics=none --virt-type=kvm --noreboot --console=pty,target_type=serial --vcpus=8 --memory=9216  --cpu=host-passthrough --network=network:swandefault,model=virtio --rng=type=random,device=/dev/random --security=type=static,model=dac,label='1000:107',relabel=yes --filesystem=target=pool,type=mount,accessmode=squash,source=/home/pool \
	--filesystem=target=source,type=mount,accessmode=squash,source=/home/libreswan/wip-nss \
	--filesystem=target=testing,type=mount,accessmode=squash,source=/home/libreswan/wip-nss/testing \
	--name=n.fedora \
	--os-variant=fedora36 \
	--disk=cache=writeback,path=/home/pool/n.fedora.qcow2 \
	--import \
	--noautoconsole

block (or be really really slow, I'm not sure as while backtracing everything it unblocked).

Comment 17 Daniel Berrangé 2022-07-21 16:22:42 UTC

Can you double check that 'virtnetworkd' really has been started /without/ the --timeout flag.

I wonder if systemctl daemon-reload was needed to pick up the config file change

I've managed to reproduce the problem myself now, and it is definitely caused by the huge number of virtual networks, and removing the --timeout flag should mean the problem only hits you /once/ the first time connecting at host boot up.

Comment 18 Cagney 2022-07-21 17:17:02 UTC

Yes, the --timeout flag was absent, and the machine is rebooted after fiddling with anything systemd related.

I've also since removed all the timeout parameters as in:

# grep ARGS /etc/sysconfig/virt*
/etc/sysconfig/virtnetworkd:VIRTNETWORKD_ARGS=""
/etc/sysconfig/virtqemud:VIRTQEMUD_ARGS=""
/etc/sysconfig/virtstoraged:VIRTSTORAGED_ARGS=""

# ps auxwwww | grep sbin/virt
root        1224  0.5  0.0 1580424 22628 ?       Ssl  11:57   0:23 /usr/sbin/virtnetworkd
root        3452 17.3  0.1 3525256 90024 ?       Ssl  11:57  12:42 /usr/sbin/virtqemud
root        3453  0.0  0.0 1758124 43920 ?       Ssl  11:57   0:00 /usr/sbin/virtstoraged
root       20802  0.0  0.0 252540 16760 ?        Ss   12:02   0:00 /usr/sbin/virtlogd

as I'm wondering if virtstoraged has similar issues.

Comment 19 Cagney 2022-07-21 17:23:00 UTC

Created attachment 1898540 [details]
backtrace of virqemud --timeout 120 when 2*<<virsh ... start ...>> blocked

Two parallel <<virsh ... start ...>> commands hung.
Pulled the backgrace for virtqemud and it seemed to unblock.

Comment 20 Daniel Berrangé 2022-07-21 18:38:50 UTC

That seems to be stuck in glibc doing username lookups. Is there anything unusual about your setup in this regard ? Ay non-standard things in /etc/nsswitch.conf ?

Comment 21 Cagney 2022-07-21 22:01:05 UTC

Created attachment 1898612 [details]
virqemu blocking again

Same as last attachment except this time left for for an hour or so.  Which means qemud is somehow stuck and pulling a stack-trace (which can trigger EAGAINs et.al.) unstuck it.  Once unstuck things seem to run ok.

Comment 22 Cagney 2022-07-21 22:09:13 UTC

There should be nothing unusual in nsswitch.conf (content below).
Only unusual thing is invoking virsh using sudo (which has been working for years).
Could there be yet another daemon lurking?  I believe I followed this https://libvirt.org/daemons.html#switching-to-modular-daemons

$ cat /etc/nsswitch.conf
# Generated by authselect on Sat Jun 11 09:42:36 2022
# Do not modify this file manually, use authselect instead. Any user changes will be overwritten.
# You can stop authselect from managing your configuration by calling 'authselect opt-out'.
# See authselect(8) for more details.

# In order of likelihood of use to accelerate lookup.
passwd:     files sss systemd
shadow:     files
group:      files sss systemd
hosts:      files myhostname mdns4_minimal [NOTFOUND=return] resolve [!UNAVAIL=return] dns
services:   files sss
netgroup:   files sss
automount:  files sss

aliases:    files
ethers:     files
gshadow:    files
networks:   files dns
protocols:  files
publickey:  files
rpc:        files

Comment 23 Cagney 2022-07-21 23:57:03 UTC

Created attachment 1898643 [details]
virtqemud when virt-install blocked

Like before, virt-install hung, but with all the following:

# grep ARGS /etc/sysconfig/virt*
/etc/sysconfig/virtnetworkd:VIRTNETWORKD_ARGS=""
/etc/sysconfig/virtqemud:VIRTQEMUD_ARGS=""
/etc/sysconfig/virtstoraged:VIRTSTORAGED_ARGS=""

gstack virtqemud unblocked it.

(the machine was also running lots of other VM but I don't believe they were blocked)

I guess workaround part two, is to occasionally pull virtqemud's stack.

Comment 24 Cagney 2022-07-27 14:38:49 UTC

I created a separate bug for the qemud deadlock #2111582
This can track the network startup.

Comment 25 Cagney 2022-07-30 02:08:30 UTC

Too much #github.  Make that bug 2111582 but it looks like it is also bug 2109451.

This leaves this bug with just the slow-start problem.

Comment 26 Cagney 2022-09-01 12:28:03 UTC

With bug 2111582 hopefully fixed this bug is down to:
- slow-start problems; worked around
- after 24hrs, operations such as destroy a vm start taking 1s (this is with the modula libvirt)
hints on how to pull a backrace for the second welcome (there's only 1s to play with)

Comment 27 Cagney 2022-09-01 12:41:51 UTC

sudo systemctl restart virtqemud
gets rid of the 1s delay

Comment 28 Cagney 2022-09-02 17:59:28 UTC

I created bug 2123828 for the slowdown problem.

Comment 29 Ben Cotton 2022-11-29 18:45:28 UTC

This message is a reminder that Fedora Linux 35 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '35'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 35 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 30 Ben Cotton 2022-12-13 17:43:08 UTC

Fedora Linux 35 entered end-of-life (EOL) status on 2022-12-13.

Fedora Linux 35 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 31 Cagney 2023-06-15 14:16:07 UTC

FYI, a clean install of f38 with no workaround seems to work much better.  I'd assume (although I can't point to specific changes) it is the result of upstream libvirt work.

Comment 32 Aoife Moloney 2023-11-23 00:11:07 UTC

This message is a reminder that Fedora Linux 37 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 37 on 2023-12-05.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '37'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 37 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 33 Aoife Moloney 2023-12-05 21:06:11 UTC

Fedora Linux 37 entered end-of-life (EOL) status on None.

Fedora Linux 37 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.