Description of problem: [tried on ppc64le] 'virsh start RHEL8.3' gets stuck when run in user mode. $ virsh dumpxml RHEL8.3 <domain type='kvm'> <name>RHEL8.3</name> <uuid>8a29b16c-0652-476e-9a91-694e1ad0b0a5</uuid> <memory unit='KiB'>8388608</memory> <currentMemory unit='KiB'>8388608</currentMemory> <vcpu placement='static'>4</vcpu> <resource> <partition>/machine</partition> </resource> <os> <type arch='ppc64le' machine='pseries-rhel8.2.0'>hvm</type> <bootmenu enable='yes'/> </os> <features> <ccf-assist state='off'/> <cfpc value='broken'/> </features> <cpu mode='host-model' check='none'/> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='cdrom'> <driver name='qemu' type='raw'/> <source file='/var/lib/libvirt/images/RHEL-8.3.0-20200909.1-ppc64le-dvd1.iso'/> <backingStore/> <target dev='sda' bus='scsi'/> <readonly/> <boot order='1'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/RHEL8.3.qcow2'/> <backingStore/> <target dev='sdb' bus='scsi'/> <boot order='2'/> <address type='drive' controller='0' bus='0' target='0' unit='1'/> </disk> <controller type='usb' index='0' model='pci-ohci'> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='0' model='pci-root'> <model name='spapr-pci-host-bridge'/> <target index='0'/> </controller> <controller type='scsi' index='0' model='ibmvscsi'> <address type='spapr-vio' reg='0x00002000'/> </controller> <serial type='pty'> <target type='spapr-vio-serial' port='0'> <model name='spapr-vty'/> </target> <address type='spapr-vio' reg='0x30000000'/> </serial> <console type='pty'> <target type='serial' port='0'/> <address type='spapr-vio' reg='0x30000000'/> </console> <input type='keyboard' bus='usb'> <address type='usb' bus='0' port='1'/> </input> <input type='mouse' bus='usb'> <address type='usb' bus='0' port='2'/> </input> <graphics type='vnc' port='-1' autoport='yes' listen='127.0.0.1'> <listen type='address' address='127.0.0.1'/> </graphics> <video> <model type='vga' vram='16384' heads='1' primary='yes'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </video> <memballoon model='virtio'> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </memballoon> <panic model='pseries'/> </devices> <seclabel type='dynamic' model='selinux' relabel='yes'/> </domain> Version-Release number of selected component (if applicable): $ virsh version Compiled against library: libvirt 6.6.0 Using library: libvirt 6.6.0 Using API: QEMU 6.6.0 Running hypervisor: QEMU 5.1.0 Steps to Reproduce: 1. virsh define RHEL8.3.xml 2. virsh start RHEL8.3 3. Actual results: virsh is now stuck. No QEMU process is running. Expected results: QEMU should be running. Additional info: stefanb 4868 1.0 0.0 68096 25728 ? S 13:47 0:00 /usr/sbin/virtlogd --timeout=120 stefanb 4871 0.0 0.0 1605696 28544 ? Ss 13:47 0:00 /usr/sbin/libvirtd --timeout=120 stefanb 4872 0.0 0.0 0 0 ? Z 13:47 0:00 [rpc-worker] <defunct> I need to kill 4871 to be able to get the console prompt back.
*** Bug 1882084 has been marked as a duplicate of this bug. ***
I'll ask Daniel Barboza to take a look at this one. In the meantime, Stefan, it would help if you can tell us if this is a regression or not.
I also now noticed that in system mode 'virsh list' when run the first time (after machine start) works well but 'virsh start' then takes a very long time (~35s) and after that 'virsh list --all' also will take a very long time.
The user mode core dump reveals the following backtrace: [stefanb@boston-1 ~]$ gdb /tmp/core.6227 GNU gdb (GDB) Red Hat Enterprise Linux 8.2-12.el8 [...] (gdb) thread apply all bt Thread 1 (Thread 0x7fff9bbeec80 (LWP 6227)): #0 0x00007fff9f756ec0 in read () at /lib64/power9/libpthread.so.0 #1 0x00007fffa08cf998 in read (__nbytes=1, __buf=0x7fff9bbed470, __fd=3) at /usr/include/bits/unistd.h:44 #2 0x00007fffa08cf998 in saferead (fd=<optimized out>, buf=0x7fff9bbed470, count=1) at ../../src/util/virfile.c:1075 #3 0x00007fffa08b5e28 in virExec (cmd=0x7fff7402f4d0) at ../../src/util/vircommand.c:780 #4 0x00007fffa08b5e28 in virCommandRunAsync (cmd=0x7fff7402f4d0, pid=<optimized out>) at ../../src/util/vircommand.c:2619 #5 0x00007fffa08b6960 in virCommandRun (cmd=0x7fff7402f4d0, exitstatus=0x0) at ../../src/util/vircommand.c:2445 #6 0x00007fff9109e3e4 in qemuProcessLaunch (conn=0x7fff7c00b650, driver=0x7fff4411a670, vm=0x1535e1230, asyncJob=<optimized out>, incoming=0x0, snapshot=0x0, vmop=<optimized out>, flags=<optimized out>) at ../../src/qemu/qemu_process.c:6835 #7 0x00007fff910a3874 in qemuProcessStart (conn=0x7fff7c00b650, driver=0x7fff4411a670, vm=0x1535e1230, updatedCPU=0x0, asyncJob=<optimized out>, migrateFrom=<optimized out>, migrateFd=<optimized out>, migratePath=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=17) at ../../src/qemu/qemu_process.c:7181 #8 0x00007fff91129ff0 in qemuDomainObjStart (conn=0x7fff7c00b650, driver=driver@entry=0x7fff4411a670, vm=0x1535e1230, flags=flags@entry=0, asyncJob=QEMU_ASYNC_JOB_START) at ../../src/qemu/qemu_driver.c:7531 #9 0x00007fff9112a778 in qemuDomainCreateWithFlags (dom=0x7fff8000b8b0, flags=<optimized out>) at ../../src/qemu/qemu_driver.c:7582 #10 0x00007fffa0bd7b60 in virDomainCreate (domain=0x7fff8000b8b0) at ../../src/libvirt-domain.c:6531 #11 0x000000011c3bc40c in remoteDispatchDomainCreate (server=0x1535dd8d0, msg=0x15363bf30, args=0x7fff7400ca60, rerr=0x7fff9bbee0c0, client=<optimized out>) at ./remote/remote_daemon_dispatch_stubs.h:4894 #12 0x000000011c3bc40c in remoteDispatchDomainCreateHelper (server=0x1535dd8d0, client=<optimized out>, msg=0x15363bf30, rerr=0x7fff9bbee0c0, args=0x7fff7400ca60, ret=0x0) at ./remote/remote_daemon_dispatch_stubs.h:4873 #13 0x00007fffa0a9acd0 in virNetServerProgramDispatchCall (msg=0x15363bf30, client=0x15365d1f0, server=0x1535dd8d0, prog=0x15363a010) at ../../src/rpc/virnetserverprogram.c:430 #14 0x00007fffa0a9acd0 in virNetServerProgramDispatch (prog=0x15363a010, server=0x1535dd8d0, client=0x15365d1f0, msg=0x15363bf30) at ../../src/rpc/virnetserverprogram.c:302 #15 0x00007fffa0aa2b60 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x1535dd8d0) at ../../src/rpc/virnetserver.c:137 #16 0x00007fffa0aa2b60 in virNetServerHandleJob (jobOpaque=0x15365c9e0, opaque=0x1535dd8d0) at ../../src/rpc/virnetserver.c:154 #17 0x00007fffa0951d2c in virThreadPoolWorker (opaque=<optimized out>) at ../../src/util/virthreadpool.c:163 #18 0x00007fffa0950d28 in virThreadHelper (data=<optimized out>) at ../../src/util/virthread.c:233 #19 0x00007fff9f7487c8 in start_thread () at /lib64/power9/libpthread.so.0 #20 0x00007fff9f5f0508 in clone () at /lib64/power9/libc.so.6 (gdb)
I am able to reproduce this problem with RHEL-AV 8.3 Libvirt and also upstream Libvirt. The problem is that we're trying to read the pidfile of the child process before the child writes it, causing the parent process to hang indefinitely. This results not only in 'virsh start' to hang, but also every other user-mode 'virsh' command to hang because the previous libvirtd spawned process is still there stuck in a loop. I just posted a patch with a proposed fix: https://www.redhat.com/archives/libvir-list/2020-October/msg00026.html
Great! I have now tried to rebuild the RHAV 8.3 rpm with this one patch to test it (with all the other applied patches). I am running into the following build error using 'rpmbuild -ba ~/rpmbuild/SPECS/libvirt.spec': [...] Traceback (most recent call last): File "/usr/bin/dtrace", line 445, in <module> sys.exit(main()) File "/usr/bin/dtrace", line 432, in main os.remove(fname) FileNotFoundError: [Errno 2] No such file or directory: '../src/libvirt_probes.o.dtrace-temp.c' make[3]: *** [Makefile:14920: ../src/libvirt_probes.o] Error 1 make[3]: *** Waiting for unfinished jobs.... [...] Am I missing a dependency?
(In reply to Stefan Berger from comment #7) > Great! > > I have now tried to rebuild the RHAV 8.3 rpm with this one patch to test it > (with all the other applied patches). I am running into the following build > error using 'rpmbuild -ba ~/rpmbuild/SPECS/libvirt.spec': > > [...] > Traceback (most recent call last): > File "/usr/bin/dtrace", line 445, in <module> > sys.exit(main()) > File "/usr/bin/dtrace", line 432, in main > os.remove(fname) > FileNotFoundError: [Errno 2] No such file or directory: > '../src/libvirt_probes.o.dtrace-temp.c' > make[3]: *** [Makefile:14920: ../src/libvirt_probes.o] Error 1 > make[3]: *** Waiting for unfinished jobs.... > [...] > > Am I missing a dependency? I think so. I just managed to create a RHAV-8.3 Libvirt RPM by doing the following, in a Power9 RHEL 8.3 box: - installed the srpm package - installed all the build dependencies using dnf build-dep - added the patch in the rpmbuild/SOURCES dir - changed rpmbuild/SPECS/libvirt.spec to add the patch in the build: (...) Patch55: 0001-vircommand.c-write-child-pidfile-before-process-tuni.patch - created the RPM with "rpmbuild -ba libvirt.spec" inside the rpmbuild/SPECS dir
Thanks for building the RPMs. The patch solves the user-mode issue.
Stefan, (In reply to Stefan Berger from comment #3) > I also now noticed that in system mode 'virsh list' when run the first time > (after machine start) works well but 'virsh start' then takes a very long > time (~35s) and after that 'virsh list --all' also will take a very long > time. This one I wasn't able to reproduce, even without the patch that fixed the user-mode problem. What I did: - rebooted the host - logged as root and tried a few 'virsh' commands: [root@ltc-boston115 ~]# [root@ltc-boston115 ~]# virsh list Id Name State -------------------- [root@ltc-boston115 ~]# virsh list --all Id Name State -------------------------- - RHEL8.3 shut off - vm1 shut off [root@ltc-boston115 ~]# virsh start RHEL8.3 Domain RHEL8.3 started [root@ltc-boston115 ~]# [root@ltc-boston115 ~]# virsh list --all Id Name State -------------------------- 1 RHEL8.3 running - vm1 shut off [root@ltc-boston115 ~]# [root@ltc-boston115 ~]# [root@ltc-boston115 ~]# [root@ltc-boston115 ~]# virsh destroy RHEL8.3 Domain RHEL8.3 destroyed [root@ltc-boston115 ~]# virsh list --all Id Name State -------------------------- - RHEL8.3 shut off - vm1 shut off [root@ltc-boston115 ~]# All of them worked as expected. No huge delays/timeout. I also tried the same test using a non-root root accessing qemu:///system (which is considered system mode as well), and haven't seen any problems either: [danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" list Id Name State -------------------- [danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" list --all Id Name State -------------------------- - RHEL8.3 shut off - vm1 shut off [danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" start RHEL8.3 Domain RHEL8.3 started [danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" list --all Id Name State -------------------------- 1 RHEL8.3 running - vm1 shut off [danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" destroy RHEL8.3 Domain RHEL8.3 destroyed [danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" list --all Id Name State -------------------------- - RHEL8.3 shut off - vm1 shut off [danielhb@ltc-boston115 ~]$ I guess I'll need more info to see what I can do in this case. Can you snag a 'thread apply all bt' when that happens?
I hopped in Stefan's environment and I could reproduce the behavior he described in comment #3 there. When doing a 'virsh start' the code hangs around in an open() call when reading the qcow2 disk. Same thing with 'virsh list' and 'virsh destroy' afterwards. Problem is that the open() call eventually returns and, if 'virsh' didn't timed out, everything works fine. This is not the same case as the original bug where there was a specific condition that caused a deadlock in Libvirt. Running 'strace' I confirmed that the 'open()' call takes between 15 and 20 seconds to complete, but there is no error being thrown or anything. Given that I couldn't reproduce this behavior anywhere else and I don't see a Libvirt/QEMU problem that might be causing, I'll be calling it a hardware related problem (disk/storage perhaps) that is causing the open() calls to take too long to complete. Stefan, feel free to open a new bug if you manage to reproduce the problem in comment #3 in another environment.
The fix for the original problem was accepted upstream: commit 0bb796bda31103ebf54eefc11c471586c87b95fd (user_mode_memlock_v1) Author: Daniel Henrique Barboza <danielhb413> Date: Thu Oct 1 10:34:24 2020 -0300 vircommand.c: write child pidfile before process tuning in virExec() v6.8.0-82-g0bb796bda3 David/Red Hat: this one might be worth a RHEL-AV 8.3 backport due to how easy it is to reproduce and how annoying it is to deal with.
Moving to ASSIGNED as the patch needs to be backported first.
RHEL-AV 8.3.0 backport of upstream commit 0bb796bda31103ebf54eefc11c471586c87b95fd Author: Daniel Henrique Barboza <danielhb413> Date: Thu Oct 1 10:34:24 2020 -0300 vircommand.c: write child pidfile before process tuning in virExec() was posted downstream.
Setting upstream keyword per comment 12.
Reproduce with libvirt-6.6.0-6.module+el8.3.0+8125+aefcf088.ppc64le # su - test Last login: Wed Nov 11 02:15:56 EST 2020 on pts/0 Use the vm xml in description to define the guest. $ virsh define test.xml Domain RHEL8.3 defined from test.xml $ virsh list --all Id Name State -------------------------- - RHEL8.3 shut off $ virsh start RHEL8.3 <hang> After ctr-c, $ virsh list --all also no return. *********************************************************** Verified with libvirt-6.6.0-8.module+el8.3.1+8648+130818f2.ppc64le qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.ppc64le kernel-4.18.0-240.el8.ppc64le $ virsh start RHEL8.3 error: Failed to start domain RHEL8.3 error: internal error: Process exited prior to exec: libvirt: error : cannot limit locked memory to 79691776: Operation not permitted No hang happens again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0639