Bug 1571546

Summary: Fail to start VM sometimes with - libvirtError: internal error: qemu unexpectedly closed the monitor
Product: [oVirt] vdsm Reporter: Michael Burman <mburman>
Component: CoreAssignee: Dan Kenigsberg <danken>
Status: CLOSED WORKSFORME QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.20.23CC: bugs, mburman, michal.skrivanek
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-26 11:03:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Michael Burman 2018-04-25 06:11:25 UTC
Created attachment 1426417 [details]
logs

Description of problem:
Fail to start VM sometimes with - libvirtError: internal error: qemu unexpectedly closed the monitor : 2018-04-25T06:04:11.614135Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NU
MA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future

2018-04-25 09:04:12,723+0300 ERROR (vm/2fe659da) [virt.vm] (vmId='2fe659da-d77f-484a-92f7-124261295d3c') The vm start process failed (vm:943)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run
    dom.createWithFlags(flags)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirtError: internal error: process exited while connecting to monitor: 2018-04-25T06:04:11.614135Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NU
MA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
2018-04-25T06:04:11.643328Z qemu-kvm: -device scsi-hd,bus=ua-272d0335-fe27-43d8-957d-176978525641.0,channel=0,scsi-id=0,lun=0,drive=drive-ua-6cec99af-a00f-409c-8817-f007c9
0ca301,id=ua-6cec99af-a00f-409c-8817-f007c90ca301,bootindex=1: Failed to get "write" lock
Is another process using the image?
2018-04-25 09:04:12,723+0300 INFO  (vm/2fe659da) [virt.vm] (vmId='2fe659da-d77f-484a-92f7-124261295d3c') Changed state to Down: internal error: process exited while connec
ting to monitor: 2018-04-25T06:04:11.614135Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings
 is obsoleted and will be removed in future
2018-04-25T06:04:11.643328Z qemu-kvm: -device scsi-hd,bus=ua-272d0335-fe27-43d8-957d-176978525641.0,channel=0,scsi-id=0,lun=0,drive=drive-ua-6cec99af-a00f-409c-8817-f007c9
0ca301,id=ua-6cec99af-a00f-409c-8817-f007c90ca301,bootindex=1: Failed to get "write" lock
Is another process using the image? (code=1) (vm:1683)

I have a VM that suddenly failing to start with the error above. I'm running all latest versions

Version-Release number of selected component (if applicable):
4.2.3.2-0.1.el7
vdsm-4.20.26-1.el7ev.x86_64
libvirt-daemon-3.9.0-14.el7_5.3.x86_64
libvirt-client-3.9.0-14.el7_5.3.x86_64

How reproducible:
Sometimes, have no idea what cause this

Steps to Reproduce:
1. Try to start VM

Actual results:
Sometimes fail with qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NU
MA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future

Comment 1 Michal Skrivanek 2018-04-26 05:33:02 UTC
Which storage do you use for that disk? 
Is it just this VM? 
Does it happen just once, next attempt to run it succeeds?
When was the previous run of that VM, was it on a different host or was there some other storage action performed?
Do you have other VM working fine with disks on the same storage?

Comment 2 Michael Burman 2018-04-26 08:18:59 UTC
(In reply to Michal Skrivanek from comment #1)
> Which storage do you use for that disk? 
> Is it just this VM? 
> Does it happen just once, next attempt to run it succeeds?
> When was the previous run of that VM, was it on a different host or was
> there some other storage action performed?
> Do you have other VM working fine with disks on the same storage?

I use nfs storage.
No, next attempts are with same error, only this VM.
This VM was running on host which vdsm died on it after upgrade see BZ 1570562
And after it, this VM can't run on any host in the cluster.
This VM used to run on all host in the cluster without any issues.
Yes, i have other VMs running on this storage.

Comment 3 Michal Skrivanek 2018-04-26 10:20:28 UTC
Hi Michael,
the VM was running *while* the vdsm was upgraded? Was it since the upgrade as mentiond in bz 1570562 - to 4.20.26, but from which version? Was there any OS package update too or just vdsm?
That host probably still holds the lock to that image - does it run any VMs right now? Can you list the locks it it still holds by running "lslocks" there?

Thanks,
michal

Comment 4 Michael Burman 2018-04-26 10:39:50 UTC
(In reply to Michal Skrivanek from comment #3)
> Hi Michael,
> the VM was running *while* the vdsm was upgraded? Was it since the upgrade
> as mentiond in bz 1570562 - to 4.20.26, but from which version? Was there
> any OS package update too or just vdsm?
> That host probably still holds the lock to that image - does it run any VMs
> right now? Can you list the locks it it still holds by running "lslocks"
> there?
> 
> Thanks,
> michal

Yes, in this specific upgrade the VM was running.
From 4.20.25 -> 4.20.26
Yes, there was a libvirt update which cane from 4.2.3 repo as well.
Yes, it does run other VM, i started new VM on it as you asked.
The buggy VM can't run on any other host in the cluster as well. Not only on this specific host.

lslocks from the host in which vdsm dead after upgrade - 

COMMAND           PID   TYPE SIZE MODE  M START        END PATH
multipathd        521  POSIX   3B WRITE 0     0          0 /run/multipathd/multipathd.pid
ovsdb-server      882  POSIX   4B WRITE 0     0          0 /run/openvswitch/ovsdb-server.pid
ovsdb-server      882  POSIX   0B WRITE 0     0          0 /etc/openvswitch/.conf.db.~lock~
ovn-controller   1064  POSIX   5B WRITE 0     0          0 /run/openvswitch/ovn-controller.pid
python           1802  FLOCK   4B WRITE 0     0          0 /run/goferd.pid
crond            1910  FLOCK   5B WRITE 0     0          0 /run/crond.pid
atd              1913  POSIX   5B WRITE 0     0          0 /run/atd.pid
(unknown)          -1 OFDLCK   0B READ  0   100        101 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2
abrtd             659  POSIX   4B WRITE 0     0          0 /run/abrt/abrtd.pid
sanlock           669  POSIX   4B WRITE 0     0          0 /run/sanlock/sanlock.pid
wdmd              704  POSIX   4B WRITE 0     0          0 /run/wdmd/wdmd.pid
ovs-vswitchd      993  POSIX   4B WRITE 0     0          0 /run/openvswitch/ovs-vswitchd.pid
rhsmcertd        1800  FLOCK   0B WRITE 0     0          0 /run/lock/subsys/rhsmcertd
iscsid           1817  POSIX   5B WRITE 0     0          0 /run/iscsid.pid
supervdsmd       1805  FLOCK   0B WRITE 0     0          0 /run/vdsm/supervdsmd.lock
libvirtd         1907  POSIX   4B WRITE 0     0          0 /run/libvirtd.pid
master           2448  FLOCK  33B WRITE 0     0          0 /var/spool/postfix/pid/master.pid
master           2448  FLOCK  33B WRITE 0     0          0 /var/lib/postfix/master.lock
vdsmd           28252  FLOCK   0B WRITE 0     0          0 /run/vdsm/vdsmd.lock
virtlogd         2656  POSIX   4B WRITE 0     0          0 /run/virtlogd.pid
(unknown)          -1 OFDLCK   0B READ  0   201        201 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2

lslocks from the second host in the cluster

COMMAND           PID   TYPE SIZE MODE  M START        END PATH
multipathd      14408  POSIX   5B WRITE 0     0          0 /run/multipathd/multipathd.pid
vdsmd           27943  FLOCK   0B WRITE 0     0          0 /run/vdsm/vdsmd.lock
(unknown)          -1 OFDLCK   0B READ  0   100        101 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2
(unknown)          -1 OFDLCK   0B READ  0   201        201 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2
(unknown)          -1 OFDLCK   0B READ  0   100        101 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2
(unknown)          -1 OFDLCK   0B READ  0   201        201 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2
libvirtd        14291  POSIX   5B WRITE 0     0          0 /run/libvirtd.pid
ovsdb-server    15660  POSIX   6B WRITE 0     0          0 /run/openvswitch/ovsdb-server.pid
ovsdb-server    15660  POSIX   0B WRITE 0     0          0 /etc/openvswitch/.conf.db.~lock~
ovs-vswitchd    15711  POSIX   6B WRITE 0     0          0 /run/openvswitch/ovs-vswitchd.pid
master           2714  FLOCK  33B WRITE 0     0          0 /var/spool/postfix/pid/master.pid
master           2714  FLOCK  33B WRITE 0     0          0 /var/lib/postfix/master.lock
ovn-controller  15785  POSIX   6B WRITE 0     0          0 /run/openvswitch/ovn-controller.pid
rhsmcertd        1159  FLOCK   0B WRITE 0     0          0 /run/lock/subsys/rhsmcertd
sanlock         14380  POSIX   6B WRITE 0     0          0 /run/sanlock/sanlock.pid
virtlogd        28857  POSIX   5B WRITE 0     0          0 /run/virtlogd.pid
iscsid          12427  POSIX   6B WRITE 0     0          0 /run/iscsid.pid
python          13686  FLOCK   5B WRITE 0     0          0 /run/goferd.pid
(unknown)          -1 OFDLCK   0B READ  0   201        201 /run
(unknown)          -1 OFDLCK   0B READ  0   203        203 /run
supervdsmd      18461  FLOCK   0B WRITE 0     0          0 /run/vdsm/supervdsmd.lock
(unknown)          -1 OFDLCK   0B READ  0   100        100 /run
crond             797  FLOCK   4B WRITE 0     0          0 /run/crond.pid
wdmd            14404  POSIX   6B WRITE 0     0          0 /run/wdmd/wdmd.pid
atd               798  POSIX   4B WRITE 0     0          0 /run/atd.pid
abrtd             758  POSIX   4B WRITE 0     0          0 /run/abrt/abrtd.pid

Comment 5 Michael Burman 2018-04-26 10:43:54 UTC
Sorry, ignore the lslocks output, it's from other env :(
I confused with BZ 1571796 which happened on my second env.

Here is the lslocks from the correct host

[root@red-vds2 ~]# lslocks 
COMMAND           PID   TYPE SIZE MODE  M START        END PATH
sanlock           724  POSIX   4B WRITE 0     0          0 /run/sanlock/sanlock.pid
wdmd              742  POSIX   4B WRITE 0     0          0 /run/wdmd/wdmd.pid
abrtd             719  POSIX   4B WRITE 0     0          0 /run/abrt/abrtd.pid
ovs-vswitchd      994  POSIX   4B WRITE 0     0          0 /run/openvswitch/ovs-vswitchd.pid
iscsid           1806  POSIX   5B WRITE 0     0          0 /run/iscsid.pid
crond            1863  FLOCK   5B WRITE 0     0          0 /run/crond.pid
master           2244  FLOCK  33B WRITE 0     0          0 /var/spool/postfix/pid/master.pid
master           2244  FLOCK  33B WRITE 0     0          0 /var/lib/postfix/master.lock
virtlogd        14223  POSIX   5B WRITE 0     0          0 /run/virtlogd.pid
libvirtd        18248  POSIX   5B WRITE 0     0          0 /run/libvirtd.pid
supervdsmd      19285  FLOCK   0B WRITE 0     0          0 /run/vdsm/supervdsmd.lock
vdsmd           19444  FLOCK   0B WRITE 0     0          0 /run/vdsm/vdsmd.lock
multipathd        545  POSIX   3B WRITE 0     0          0 /run/multipathd/multipathd.pid
ovsdb-server      905  POSIX   4B WRITE 0     0          0 /run/openvswitch/ovsdb-server.pid
ovsdb-server      905  POSIX   0B WRITE 0     0          0 /etc/openvswitch/.conf.db.~lock~
ovn-controller   1068  POSIX   5B WRITE 0     0          0 /run/openvswitch/ovn-controller.pid
rhsmcertd        1799  FLOCK   0B WRITE 0     0          0 /run/lock/subsys/rhsmcertd
atd              1862  POSIX   5B WRITE 0     0          0 /run/atd.pid
python           1809  FLOCK   4B WRITE 0     0          0 /run/goferd.pid
(unknown)          -1 OFDLCK   0B READ  0   100        101 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond3

But, now the VM can run after i re-installed vdsm.

Comment 6 Michal Skrivanek 2018-04-26 11:03:49 UTC
it likely happened due to the crash. In RHEL 7.5 QEMU implemented its own file locking mechanism so whenever you see the same problem you can see which process holds the lock - and on NFS shared storage no other host would be able to start that VM because they all can "see" the lock as taken.

If you ever see "Failed to get "write" lock" please take a look which disk it is (the device name match the VM xml few lines above that error in vdsm.log) and the file path should be locked by someone (check by lslocks). It can either be vdsm when it crashes on some storage manipulation operation or a stuck QEMU process on the same or any other host.

Since you say it fixed when you reinstalled vdsm it seems it was the vdsm proces but it's not possible to confirm that anymore. 
Please reopen when it happens again

Also, it's not supported to run VMs on the host and do an upgrade, hosts are supposed to be in maintenance - we're trying to enforce this for major upgrade at least, in bug 1570349. But that won't solve a manual error for any other flow

Comment 7 Michael Burman 2018-04-26 11:40:31 UTC
(In reply to Michal Skrivanek from comment #6)
> it likely happened due to the crash. In RHEL 7.5 QEMU implemented its own
> file locking mechanism so whenever you see the same problem you can see
> which process holds the lock - and on NFS shared storage no other host would
> be able to start that VM because they all can "see" the lock as taken.
> 
> If you ever see "Failed to get "write" lock" please take a look which disk
> it is (the device name match the VM xml few lines above that error in
> vdsm.log) and the file path should be locked by someone (check by lslocks).
> It can either be vdsm when it crashes on some storage manipulation operation
> or a stuck QEMU process on the same or any other host.
> 
> Since you say it fixed when you reinstalled vdsm it seems it was the vdsm
> proces but it's not possible to confirm that anymore. 
> Please reopen when it happens again
> 
> Also, it's not supported to run VMs on the host and do an upgrade, hosts are
> supposed to be in maintenance - we're trying to enforce this for major
> upgrade at least, in bug 1570349. But that won't solve a manual error for
> any other flow

Ok Michal thanks,
I know it's not supported, but this flow is part of the qe tier4 manual testing which i now execute often.