Bug 1571546
Summary: | Fail to start VM sometimes with - libvirtError: internal error: qemu unexpectedly closed the monitor | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Michael Burman <mburman> | ||||
Component: | Core | Assignee: | Dan Kenigsberg <danken> | ||||
Status: | CLOSED WORKSFORME | QA Contact: | meital avital <mavital> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.20.23 | CC: | bugs, mburman, michal.skrivanek | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-04-26 11:03:49 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Which storage do you use for that disk? Is it just this VM? Does it happen just once, next attempt to run it succeeds? When was the previous run of that VM, was it on a different host or was there some other storage action performed? Do you have other VM working fine with disks on the same storage? (In reply to Michal Skrivanek from comment #1) > Which storage do you use for that disk? > Is it just this VM? > Does it happen just once, next attempt to run it succeeds? > When was the previous run of that VM, was it on a different host or was > there some other storage action performed? > Do you have other VM working fine with disks on the same storage? I use nfs storage. No, next attempts are with same error, only this VM. This VM was running on host which vdsm died on it after upgrade see BZ 1570562 And after it, this VM can't run on any host in the cluster. This VM used to run on all host in the cluster without any issues. Yes, i have other VMs running on this storage. Hi Michael, the VM was running *while* the vdsm was upgraded? Was it since the upgrade as mentiond in bz 1570562 - to 4.20.26, but from which version? Was there any OS package update too or just vdsm? That host probably still holds the lock to that image - does it run any VMs right now? Can you list the locks it it still holds by running "lslocks" there? Thanks, michal (In reply to Michal Skrivanek from comment #3) > Hi Michael, > the VM was running *while* the vdsm was upgraded? Was it since the upgrade > as mentiond in bz 1570562 - to 4.20.26, but from which version? Was there > any OS package update too or just vdsm? > That host probably still holds the lock to that image - does it run any VMs > right now? Can you list the locks it it still holds by running "lslocks" > there? > > Thanks, > michal Yes, in this specific upgrade the VM was running. From 4.20.25 -> 4.20.26 Yes, there was a libvirt update which cane from 4.2.3 repo as well. Yes, it does run other VM, i started new VM on it as you asked. The buggy VM can't run on any other host in the cluster as well. Not only on this specific host. lslocks from the host in which vdsm dead after upgrade - COMMAND PID TYPE SIZE MODE M START END PATH multipathd 521 POSIX 3B WRITE 0 0 0 /run/multipathd/multipathd.pid ovsdb-server 882 POSIX 4B WRITE 0 0 0 /run/openvswitch/ovsdb-server.pid ovsdb-server 882 POSIX 0B WRITE 0 0 0 /etc/openvswitch/.conf.db.~lock~ ovn-controller 1064 POSIX 5B WRITE 0 0 0 /run/openvswitch/ovn-controller.pid python 1802 FLOCK 4B WRITE 0 0 0 /run/goferd.pid crond 1910 FLOCK 5B WRITE 0 0 0 /run/crond.pid atd 1913 POSIX 5B WRITE 0 0 0 /run/atd.pid (unknown) -1 OFDLCK 0B READ 0 100 101 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2 abrtd 659 POSIX 4B WRITE 0 0 0 /run/abrt/abrtd.pid sanlock 669 POSIX 4B WRITE 0 0 0 /run/sanlock/sanlock.pid wdmd 704 POSIX 4B WRITE 0 0 0 /run/wdmd/wdmd.pid ovs-vswitchd 993 POSIX 4B WRITE 0 0 0 /run/openvswitch/ovs-vswitchd.pid rhsmcertd 1800 FLOCK 0B WRITE 0 0 0 /run/lock/subsys/rhsmcertd iscsid 1817 POSIX 5B WRITE 0 0 0 /run/iscsid.pid supervdsmd 1805 FLOCK 0B WRITE 0 0 0 /run/vdsm/supervdsmd.lock libvirtd 1907 POSIX 4B WRITE 0 0 0 /run/libvirtd.pid master 2448 FLOCK 33B WRITE 0 0 0 /var/spool/postfix/pid/master.pid master 2448 FLOCK 33B WRITE 0 0 0 /var/lib/postfix/master.lock vdsmd 28252 FLOCK 0B WRITE 0 0 0 /run/vdsm/vdsmd.lock virtlogd 2656 POSIX 4B WRITE 0 0 0 /run/virtlogd.pid (unknown) -1 OFDLCK 0B READ 0 201 201 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2 lslocks from the second host in the cluster COMMAND PID TYPE SIZE MODE M START END PATH multipathd 14408 POSIX 5B WRITE 0 0 0 /run/multipathd/multipathd.pid vdsmd 27943 FLOCK 0B WRITE 0 0 0 /run/vdsm/vdsmd.lock (unknown) -1 OFDLCK 0B READ 0 100 101 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2 (unknown) -1 OFDLCK 0B READ 0 201 201 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2 (unknown) -1 OFDLCK 0B READ 0 100 101 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2 (unknown) -1 OFDLCK 0B READ 0 201 201 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond2 libvirtd 14291 POSIX 5B WRITE 0 0 0 /run/libvirtd.pid ovsdb-server 15660 POSIX 6B WRITE 0 0 0 /run/openvswitch/ovsdb-server.pid ovsdb-server 15660 POSIX 0B WRITE 0 0 0 /etc/openvswitch/.conf.db.~lock~ ovs-vswitchd 15711 POSIX 6B WRITE 0 0 0 /run/openvswitch/ovs-vswitchd.pid master 2714 FLOCK 33B WRITE 0 0 0 /var/spool/postfix/pid/master.pid master 2714 FLOCK 33B WRITE 0 0 0 /var/lib/postfix/master.lock ovn-controller 15785 POSIX 6B WRITE 0 0 0 /run/openvswitch/ovn-controller.pid rhsmcertd 1159 FLOCK 0B WRITE 0 0 0 /run/lock/subsys/rhsmcertd sanlock 14380 POSIX 6B WRITE 0 0 0 /run/sanlock/sanlock.pid virtlogd 28857 POSIX 5B WRITE 0 0 0 /run/virtlogd.pid iscsid 12427 POSIX 6B WRITE 0 0 0 /run/iscsid.pid python 13686 FLOCK 5B WRITE 0 0 0 /run/goferd.pid (unknown) -1 OFDLCK 0B READ 0 201 201 /run (unknown) -1 OFDLCK 0B READ 0 203 203 /run supervdsmd 18461 FLOCK 0B WRITE 0 0 0 /run/vdsm/supervdsmd.lock (unknown) -1 OFDLCK 0B READ 0 100 100 /run crond 797 FLOCK 4B WRITE 0 0 0 /run/crond.pid wdmd 14404 POSIX 6B WRITE 0 0 0 /run/wdmd/wdmd.pid atd 798 POSIX 4B WRITE 0 0 0 /run/atd.pid abrtd 758 POSIX 4B WRITE 0 0 0 /run/abrt/abrtd.pid Sorry, ignore the lslocks output, it's from other env :( I confused with BZ 1571796 which happened on my second env. Here is the lslocks from the correct host [root@red-vds2 ~]# lslocks COMMAND PID TYPE SIZE MODE M START END PATH sanlock 724 POSIX 4B WRITE 0 0 0 /run/sanlock/sanlock.pid wdmd 742 POSIX 4B WRITE 0 0 0 /run/wdmd/wdmd.pid abrtd 719 POSIX 4B WRITE 0 0 0 /run/abrt/abrtd.pid ovs-vswitchd 994 POSIX 4B WRITE 0 0 0 /run/openvswitch/ovs-vswitchd.pid iscsid 1806 POSIX 5B WRITE 0 0 0 /run/iscsid.pid crond 1863 FLOCK 5B WRITE 0 0 0 /run/crond.pid master 2244 FLOCK 33B WRITE 0 0 0 /var/spool/postfix/pid/master.pid master 2244 FLOCK 33B WRITE 0 0 0 /var/lib/postfix/master.lock virtlogd 14223 POSIX 5B WRITE 0 0 0 /run/virtlogd.pid libvirtd 18248 POSIX 5B WRITE 0 0 0 /run/libvirtd.pid supervdsmd 19285 FLOCK 0B WRITE 0 0 0 /run/vdsm/supervdsmd.lock vdsmd 19444 FLOCK 0B WRITE 0 0 0 /run/vdsm/vdsmd.lock multipathd 545 POSIX 3B WRITE 0 0 0 /run/multipathd/multipathd.pid ovsdb-server 905 POSIX 4B WRITE 0 0 0 /run/openvswitch/ovsdb-server.pid ovsdb-server 905 POSIX 0B WRITE 0 0 0 /etc/openvswitch/.conf.db.~lock~ ovn-controller 1068 POSIX 5B WRITE 0 0 0 /run/openvswitch/ovn-controller.pid rhsmcertd 1799 FLOCK 0B WRITE 0 0 0 /run/lock/subsys/rhsmcertd atd 1862 POSIX 5B WRITE 0 0 0 /run/atd.pid python 1809 FLOCK 4B WRITE 0 0 0 /run/goferd.pid (unknown) -1 OFDLCK 0B READ 0 100 101 /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_RHV__NFS_mburman_HE__Over__bond3 But, now the VM can run after i re-installed vdsm. it likely happened due to the crash. In RHEL 7.5 QEMU implemented its own file locking mechanism so whenever you see the same problem you can see which process holds the lock - and on NFS shared storage no other host would be able to start that VM because they all can "see" the lock as taken. If you ever see "Failed to get "write" lock" please take a look which disk it is (the device name match the VM xml few lines above that error in vdsm.log) and the file path should be locked by someone (check by lslocks). It can either be vdsm when it crashes on some storage manipulation operation or a stuck QEMU process on the same or any other host. Since you say it fixed when you reinstalled vdsm it seems it was the vdsm proces but it's not possible to confirm that anymore. Please reopen when it happens again Also, it's not supported to run VMs on the host and do an upgrade, hosts are supposed to be in maintenance - we're trying to enforce this for major upgrade at least, in bug 1570349. But that won't solve a manual error for any other flow (In reply to Michal Skrivanek from comment #6) > it likely happened due to the crash. In RHEL 7.5 QEMU implemented its own > file locking mechanism so whenever you see the same problem you can see > which process holds the lock - and on NFS shared storage no other host would > be able to start that VM because they all can "see" the lock as taken. > > If you ever see "Failed to get "write" lock" please take a look which disk > it is (the device name match the VM xml few lines above that error in > vdsm.log) and the file path should be locked by someone (check by lslocks). > It can either be vdsm when it crashes on some storage manipulation operation > or a stuck QEMU process on the same or any other host. > > Since you say it fixed when you reinstalled vdsm it seems it was the vdsm > proces but it's not possible to confirm that anymore. > Please reopen when it happens again > > Also, it's not supported to run VMs on the host and do an upgrade, hosts are > supposed to be in maintenance - we're trying to enforce this for major > upgrade at least, in bug 1570349. But that won't solve a manual error for > any other flow Ok Michal thanks, I know it's not supported, but this flow is part of the qe tier4 manual testing which i now execute often. |
Created attachment 1426417 [details] logs Description of problem: Fail to start VM sometimes with - libvirtError: internal error: qemu unexpectedly closed the monitor : 2018-04-25T06:04:11.614135Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NU MA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future 2018-04-25 09:04:12,723+0300 ERROR (vm/2fe659da) [virt.vm] (vmId='2fe659da-d77f-484a-92f7-124261295d3c') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: internal error: process exited while connecting to monitor: 2018-04-25T06:04:11.614135Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NU MA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future 2018-04-25T06:04:11.643328Z qemu-kvm: -device scsi-hd,bus=ua-272d0335-fe27-43d8-957d-176978525641.0,channel=0,scsi-id=0,lun=0,drive=drive-ua-6cec99af-a00f-409c-8817-f007c9 0ca301,id=ua-6cec99af-a00f-409c-8817-f007c90ca301,bootindex=1: Failed to get "write" lock Is another process using the image? 2018-04-25 09:04:12,723+0300 INFO (vm/2fe659da) [virt.vm] (vmId='2fe659da-d77f-484a-92f7-124261295d3c') Changed state to Down: internal error: process exited while connec ting to monitor: 2018-04-25T06:04:11.614135Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future 2018-04-25T06:04:11.643328Z qemu-kvm: -device scsi-hd,bus=ua-272d0335-fe27-43d8-957d-176978525641.0,channel=0,scsi-id=0,lun=0,drive=drive-ua-6cec99af-a00f-409c-8817-f007c9 0ca301,id=ua-6cec99af-a00f-409c-8817-f007c90ca301,bootindex=1: Failed to get "write" lock Is another process using the image? (code=1) (vm:1683) I have a VM that suddenly failing to start with the error above. I'm running all latest versions Version-Release number of selected component (if applicable): 4.2.3.2-0.1.el7 vdsm-4.20.26-1.el7ev.x86_64 libvirt-daemon-3.9.0-14.el7_5.3.x86_64 libvirt-client-3.9.0-14.el7_5.3.x86_64 How reproducible: Sometimes, have no idea what cause this Steps to Reproduce: 1. Try to start VM Actual results: Sometimes fail with qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NU MA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future