Bug 1882093 - virsh getting stuck when trying to start VM in user mode (libvirt 6.6)
Summary: virsh getting stuck when trying to start VM in user mode (libvirt 6.6)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.3
Hardware: ppc64le
OS: Linux
medium
medium
Target Milestone: rc
: 8.4
Assignee: Daniel Henrique Barboza (IBM)
QA Contact: Dan Zheng
URL:
Whiteboard:
: 1882084 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-23 18:49 UTC by Stefan Berger
Modified: 2021-02-22 15:40 UTC (History)
9 users (show)

Fixed In Version: libvirt-6.6.0-8.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-22 15:39:38 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Stefan Berger 2020-09-23 18:49:10 UTC
Description of problem:

[tried on ppc64le]

'virsh start RHEL8.3' gets stuck when run in user mode.

$ virsh dumpxml RHEL8.3
<domain type='kvm'>
  <name>RHEL8.3</name>
  <uuid>8a29b16c-0652-476e-9a91-694e1ad0b0a5</uuid>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='ppc64le' machine='pseries-rhel8.2.0'>hvm</type>
    <bootmenu enable='yes'/>
  </os>
  <features>
    <ccf-assist state='off'/>
    <cfpc value='broken'/>
  </features>
  <cpu mode='host-model' check='none'/>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/RHEL-8.3.0-20200909.1-ppc64le-dvd1.iso'/>
      <backingStore/>
      <target dev='sda' bus='scsi'/>
      <readonly/>
      <boot order='1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/RHEL8.3.qcow2'/>
      <backingStore/>
      <target dev='sdb' bus='scsi'/>
      <boot order='2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <controller type='usb' index='0' model='pci-ohci'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <model name='spapr-pci-host-bridge'/>
      <target index='0'/>
    </controller>
    <controller type='scsi' index='0' model='ibmvscsi'>
      <address type='spapr-vio' reg='0x00002000'/>
    </controller>
    <serial type='pty'>
      <target type='spapr-vio-serial' port='0'>
        <model name='spapr-vty'/>
      </target>
      <address type='spapr-vio' reg='0x30000000'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
      <address type='spapr-vio' reg='0x30000000'/>
    </console>
    <input type='keyboard' bus='usb'>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='usb'>
      <address type='usb' bus='0' port='2'/>
    </input>
    <graphics type='vnc' port='-1' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>
    <video>
      <model type='vga' vram='16384' heads='1' primary='yes'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </memballoon>
    <panic model='pseries'/>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'/>
</domain>

Version-Release number of selected component (if applicable):

$ virsh version
Compiled against library: libvirt 6.6.0
Using library: libvirt 6.6.0
Using API: QEMU 6.6.0
Running hypervisor: QEMU 5.1.0


Steps to Reproduce:
1. virsh define RHEL8.3.xml
2. virsh start RHEL8.3
3.

Actual results:

virsh is now stuck. No QEMU process is running.

Expected results:

QEMU should be running.

Additional info:

stefanb     4868  1.0  0.0  68096 25728 ?        S    13:47   0:00 /usr/sbin/virtlogd --timeout=120
stefanb     4871  0.0  0.0 1605696 28544 ?       Ss   13:47   0:00 /usr/sbin/libvirtd --timeout=120
stefanb     4872  0.0  0.0      0     0 ?        Z    13:47   0:00 [rpc-worker] <defunct>

I need to kill  4871 to be able to get the console prompt back.

Comment 1 Stefan Berger 2020-09-23 19:59:55 UTC
*** Bug 1882084 has been marked as a duplicate of this bug. ***

Comment 2 David Gibson 2020-09-28 05:06:11 UTC
I'll ask Daniel Barboza to take a look at this one.

In the meantime, Stefan, it would help if you can tell us if this is a regression or not.

Comment 3 Stefan Berger 2020-09-29 20:25:37 UTC
I also now noticed that in system mode 'virsh list' when run the first time (after machine start) works well but 'virsh start' then takes a very long time (~35s) and after that 'virsh list --all' also will take a very long time.

Comment 4 Stefan Berger 2020-09-29 20:52:45 UTC
The user mode core dump reveals the following backtrace:

[stefanb@boston-1 ~]$ gdb /tmp/core.6227
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-12.el8
[...]
(gdb) thread apply all bt

Thread 1 (Thread 0x7fff9bbeec80 (LWP 6227)):
#0  0x00007fff9f756ec0 in read () at /lib64/power9/libpthread.so.0
#1  0x00007fffa08cf998 in read (__nbytes=1, __buf=0x7fff9bbed470, __fd=3) at /usr/include/bits/unistd.h:44
#2  0x00007fffa08cf998 in saferead (fd=<optimized out>, buf=0x7fff9bbed470, count=1) at ../../src/util/virfile.c:1075
#3  0x00007fffa08b5e28 in virExec (cmd=0x7fff7402f4d0) at ../../src/util/vircommand.c:780
#4  0x00007fffa08b5e28 in virCommandRunAsync (cmd=0x7fff7402f4d0, pid=<optimized out>) at ../../src/util/vircommand.c:2619
#5  0x00007fffa08b6960 in virCommandRun (cmd=0x7fff7402f4d0, exitstatus=0x0) at ../../src/util/vircommand.c:2445
#6  0x00007fff9109e3e4 in qemuProcessLaunch
    (conn=0x7fff7c00b650, driver=0x7fff4411a670, vm=0x1535e1230, asyncJob=<optimized out>, incoming=0x0, snapshot=0x0, vmop=<optimized out>, flags=<optimized out>) at ../../src/qemu/qemu_process.c:6835
#7  0x00007fff910a3874 in qemuProcessStart
    (conn=0x7fff7c00b650, driver=0x7fff4411a670, vm=0x1535e1230, updatedCPU=0x0, asyncJob=<optimized out>, migrateFrom=<optimized out>, migrateFd=<optimized out>, migratePath=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=17) at ../../src/qemu/qemu_process.c:7181
#8  0x00007fff91129ff0 in qemuDomainObjStart
    (conn=0x7fff7c00b650, driver=driver@entry=0x7fff4411a670, vm=0x1535e1230, flags=flags@entry=0, asyncJob=QEMU_ASYNC_JOB_START)
    at ../../src/qemu/qemu_driver.c:7531
#9  0x00007fff9112a778 in qemuDomainCreateWithFlags (dom=0x7fff8000b8b0, flags=<optimized out>) at ../../src/qemu/qemu_driver.c:7582
#10 0x00007fffa0bd7b60 in virDomainCreate (domain=0x7fff8000b8b0) at ../../src/libvirt-domain.c:6531
#11 0x000000011c3bc40c in remoteDispatchDomainCreate (server=0x1535dd8d0, msg=0x15363bf30, args=0x7fff7400ca60, rerr=0x7fff9bbee0c0, client=<optimized out>)
    at ./remote/remote_daemon_dispatch_stubs.h:4894
#12 0x000000011c3bc40c in remoteDispatchDomainCreateHelper
    (server=0x1535dd8d0, client=<optimized out>, msg=0x15363bf30, rerr=0x7fff9bbee0c0, args=0x7fff7400ca60, ret=0x0)
    at ./remote/remote_daemon_dispatch_stubs.h:4873
#13 0x00007fffa0a9acd0 in virNetServerProgramDispatchCall (msg=0x15363bf30, client=0x15365d1f0, server=0x1535dd8d0, prog=0x15363a010)
    at ../../src/rpc/virnetserverprogram.c:430
#14 0x00007fffa0a9acd0 in virNetServerProgramDispatch (prog=0x15363a010, server=0x1535dd8d0, client=0x15365d1f0, msg=0x15363bf30)
    at ../../src/rpc/virnetserverprogram.c:302
#15 0x00007fffa0aa2b60 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x1535dd8d0)
    at ../../src/rpc/virnetserver.c:137
#16 0x00007fffa0aa2b60 in virNetServerHandleJob (jobOpaque=0x15365c9e0, opaque=0x1535dd8d0) at ../../src/rpc/virnetserver.c:154
#17 0x00007fffa0951d2c in virThreadPoolWorker (opaque=<optimized out>) at ../../src/util/virthreadpool.c:163
#18 0x00007fffa0950d28 in virThreadHelper (data=<optimized out>) at ../../src/util/virthread.c:233
#19 0x00007fff9f7487c8 in start_thread () at /lib64/power9/libpthread.so.0
#20 0x00007fff9f5f0508 in clone () at /lib64/power9/libc.so.6
(gdb)

Comment 5 Daniel Henrique Barboza (IBM) 2020-10-01 14:38:09 UTC
I am able to reproduce this problem with RHEL-AV 8.3 Libvirt and
also upstream Libvirt.

The problem is that we're trying to read the pidfile of the child
process before the child writes it, causing the parent process
to hang indefinitely. This results not only in 'virsh start' to
hang, but also every other user-mode 'virsh' command to hang because
the previous libvirtd spawned process is still there stuck in a
loop.

I just posted a patch with a proposed fix:

https://www.redhat.com/archives/libvir-list/2020-October/msg00026.html

Comment 7 Stefan Berger 2020-10-02 14:26:55 UTC
Great!

I have now tried to rebuild the RHAV 8.3 rpm with this one patch to test it (with all the other applied patches). I am running into the following build error using 'rpmbuild -ba ~/rpmbuild/SPECS/libvirt.spec':

[...]
Traceback (most recent call last):
  File "/usr/bin/dtrace", line 445, in <module>
    sys.exit(main())
  File "/usr/bin/dtrace", line 432, in main
    os.remove(fname)
FileNotFoundError: [Errno 2] No such file or directory: '../src/libvirt_probes.o.dtrace-temp.c'
make[3]: *** [Makefile:14920: ../src/libvirt_probes.o] Error 1
make[3]: *** Waiting for unfinished jobs....
[...]

Am I missing a dependency?

Comment 8 Daniel Henrique Barboza (IBM) 2020-10-02 20:23:33 UTC
(In reply to Stefan Berger from comment #7)
> Great!
> 
> I have now tried to rebuild the RHAV 8.3 rpm with this one patch to test it
> (with all the other applied patches). I am running into the following build
> error using 'rpmbuild -ba ~/rpmbuild/SPECS/libvirt.spec':
> 
> [...]
> Traceback (most recent call last):
>   File "/usr/bin/dtrace", line 445, in <module>
>     sys.exit(main())
>   File "/usr/bin/dtrace", line 432, in main
>     os.remove(fname)
> FileNotFoundError: [Errno 2] No such file or directory:
> '../src/libvirt_probes.o.dtrace-temp.c'
> make[3]: *** [Makefile:14920: ../src/libvirt_probes.o] Error 1
> make[3]: *** Waiting for unfinished jobs....
> [...]
> 
> Am I missing a dependency?

I think so. I just managed to create a RHAV-8.3 Libvirt RPM by doing the following,
in a Power9 RHEL 8.3 box:

- installed the srpm package

- installed all the build dependencies using dnf build-dep

- added the patch in the rpmbuild/SOURCES dir

- changed rpmbuild/SPECS/libvirt.spec to add the patch in the build:

  (...)
  Patch55: 0001-vircommand.c-write-child-pidfile-before-process-tuni.patch

- created the RPM with "rpmbuild -ba libvirt.spec" inside the rpmbuild/SPECS dir

Comment 9 Stefan Berger 2020-10-02 21:50:21 UTC
Thanks for building the RPMs. The patch solves the user-mode issue.

Comment 10 Daniel Henrique Barboza (IBM) 2020-10-03 12:44:38 UTC
Stefan,

(In reply to Stefan Berger from comment #3)
> I also now noticed that in system mode 'virsh list' when run the first time
> (after machine start) works well but 'virsh start' then takes a very long
> time (~35s) and after that 'virsh list --all' also will take a very long
> time.

This one I wasn't able to reproduce, even without the patch that fixed
the user-mode problem. What I did:

- rebooted the host

- logged as root and tried a few 'virsh' commands:

[root@ltc-boston115 ~]# 
[root@ltc-boston115 ~]# virsh list
 Id   Name   State
--------------------

[root@ltc-boston115 ~]# virsh list --all
 Id   Name      State
--------------------------
 -    RHEL8.3   shut off
 -    vm1       shut off

[root@ltc-boston115 ~]# virsh start RHEL8.3 
Domain RHEL8.3 started

[root@ltc-boston115 ~]# 
[root@ltc-boston115 ~]# virsh list --all
 Id   Name      State
--------------------------
 1    RHEL8.3   running
 -    vm1       shut off

[root@ltc-boston115 ~]# 
[root@ltc-boston115 ~]# 
[root@ltc-boston115 ~]# 
[root@ltc-boston115 ~]# virsh destroy RHEL8.3 
Domain RHEL8.3 destroyed

[root@ltc-boston115 ~]# virsh list --all
 Id   Name      State
--------------------------
 -    RHEL8.3   shut off
 -    vm1       shut off

[root@ltc-boston115 ~]# 


All of them worked as expected. No huge delays/timeout. I also tried
the same test using a non-root root accessing qemu:///system
(which is considered system mode as well), and haven't seen any problems
either:

[danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" list 
 Id   Name   State
--------------------

[danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" list --all
 Id   Name      State
--------------------------
 -    RHEL8.3   shut off
 -    vm1       shut off

[danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" start RHEL8.3
Domain RHEL8.3 started

[danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" list --all
 Id   Name      State
--------------------------
 1    RHEL8.3   running
 -    vm1       shut off

[danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" destroy RHEL8.3
Domain RHEL8.3 destroyed

[danielhb@ltc-boston115 ~]$ virsh --c "qemu:///system" list --all
 Id   Name      State
--------------------------
 -    RHEL8.3   shut off
 -    vm1       shut off

[danielhb@ltc-boston115 ~]$ 



I guess I'll need more info to see what I can do in this case. Can you
snag a 'thread apply all bt' when that happens?

Comment 11 Daniel Henrique Barboza (IBM) 2020-10-07 19:36:02 UTC
I hopped in Stefan's environment and I could reproduce the behavior
he described in comment #3 there. When doing a 'virsh start' the code
hangs around in an open() call when reading the qcow2 disk. Same thing
with 'virsh list' and 'virsh destroy' afterwards.

Problem is that the open() call eventually returns and, if 'virsh' didn't
timed out, everything works fine. This is not the same case as the
original bug where there was a specific condition that caused a deadlock
in Libvirt. Running 'strace' I confirmed that the 'open()' call takes
between 15 and 20 seconds to complete, but there is no error being
thrown or anything.

Given that I couldn't reproduce this behavior anywhere else and I don't
see a Libvirt/QEMU problem that might be causing, I'll be calling it
a hardware related problem (disk/storage perhaps) that is causing the
open() calls to take too long to complete.

Stefan, feel free to open a new bug if you manage to reproduce the
problem in comment #3 in another environment.

Comment 12 Daniel Henrique Barboza (IBM) 2020-10-07 19:42:06 UTC
The fix for the original problem was accepted upstream:


commit 0bb796bda31103ebf54eefc11c471586c87b95fd (user_mode_memlock_v1)
Author: Daniel Henrique Barboza <danielhb413>
Date:   Thu Oct 1 10:34:24 2020 -0300

    vircommand.c: write child pidfile before process tuning in virExec()

v6.8.0-82-g0bb796bda3


David/Red Hat: this one might be worth a RHEL-AV 8.3 backport due to how
easy it is to reproduce and how annoying it is to deal with.

Comment 19 Jiri Denemark 2020-10-27 10:14:54 UTC
Moving to ASSIGNED as the patch needs to be backported first.

Comment 20 Daniel Henrique Barboza (IBM) 2020-10-27 20:37:10 UTC
RHEL-AV 8.3.0 backport of upstream


commit 0bb796bda31103ebf54eefc11c471586c87b95fd
Author: Daniel Henrique Barboza <danielhb413>
Date:   Thu Oct 1 10:34:24 2020 -0300

    vircommand.c: write child pidfile before process tuning in virExec()


was posted downstream.

Comment 22 Michal Privoznik 2020-10-28 10:05:59 UTC
Setting upstream keyword per comment 12.

Comment 25 Dan Zheng 2020-11-11 08:21:25 UTC
Reproduce with
libvirt-6.6.0-6.module+el8.3.0+8125+aefcf088.ppc64le


# su - test
Last login: Wed Nov 11 02:15:56 EST 2020 on pts/0

Use the vm xml in description to define the guest.
$ virsh define test.xml
Domain RHEL8.3 defined from test.xml

$ virsh list --all
 Id   Name      State
--------------------------
 -    RHEL8.3   shut off

$ virsh start RHEL8.3

<hang>

After ctr-c, $ virsh list --all also no return.
***********************************************************
Verified with
libvirt-6.6.0-8.module+el8.3.1+8648+130818f2.ppc64le
qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.ppc64le
kernel-4.18.0-240.el8.ppc64le


$ virsh start RHEL8.3
error: Failed to start domain RHEL8.3
error: internal error: Process exited prior to exec: libvirt:  error : cannot limit locked memory to 79691776: Operation not permitted

No hang happens again.

Comment 27 errata-xmlrpc 2021-02-22 15:39:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0639


Note You need to log in before you can comment on or make changes to this bug.