Bug 1576306

Summary: Wake up from hibernation failed:internal error: unable to execute QEMU command 'cont': Failed to get "write" lock.
Product: [oVirt] ovirt-engine Reporter: Israel Pinto <ipinto>
Component: BLL.StorageAssignee: Tal Nisan <tnisan>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.3.2CC: bugs, famz, ipinto, kwolf, tnisan
Target Milestone: ---Keywords: Automation
Target Release: ---Flags: ipinto: planning_ack?
ipinto: devel_ack?
ipinto: testing_ack?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-14 14:46:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
host-var/run logs
none
host-var/log logs
none
engine log
none
last engine log
none
logs_13_5_18 none

Description Israel Pinto 2018-05-09 08:09:04 UTC
Description of problem:
Failed to start VM with hot-plug memory after suspension. 

Version-Release number of selected component (if applicable):
Engine Version:4.2.3.5-0.1.el7
Host: 
OS Version:RHEL - 7.5 - 8.el7
Kernel Version:3.10.0 - 862.el7.x86_64
KVM Version:2.10.0 - 21.el7_5.2
LIBVIRT Version:libvirt-3.9.0-14.el7_5.3
VDSM Version:vdsm-4.20.27.1-1.el7ev

Steps to Reproduce:
1. Start VM and hotplug memory to VM
2. Suspend VM 
3. Resume VM

Actual results:
Failed to start VM


Additional info:
Connection to NFS is ok
[root@master-vds10 ~]# ping yellow-vdsb.qa.lab.tlv.redhat.com
PING yellow-vdsb.qa.lab.tlv.redhat.com (10.35.80.5) 56(84) bytes of data.
64 bytes from yellow-vdsb.qa.lab.tlv.redhat.com (10.35.80.5): icmp_seq=1 ttl=63 time=0.160 ms
64 bytes from yellow-vdsb.qa.lab.tlv.redhat.com (10.35.80.5): icmp_seq=2 ttl=63 time=0.143 ms

Engine log:
2018-05-09 09:38:56,484+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-2) [] add VM '0c85e9db-fbb8-4cad-9d6b-4482024dc762'(memory_hotplug_test) to rerun treatment
2018-05-09 09:38:56,491+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (ForkJoinPool-1-worker-2) [] Rerun VM '0c85e9db-fbb8-4cad-9d6b-4482024dc762'. Called from VDS 'host_mixed_1'
2018-05-09 09:38:56,514+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-21315) [] EVENT_ID: USER_INITIATED_RUN_VM_FAILED(151), Failed to run VM memory_hotplug_test on Host host_mixed_1.
...
2018-05-09 09:38:59,552+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '0c85e9db-fbb8-4cad-9d6b-4482024dc762'(memory_hotplug_test) moved from 'WaitForLaunch' --> 'Down'
2018-05-09 09:38:59,582+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-4) [] EVENT_ID: VM_DOWN_ERROR(119), VM memory_hotplug_test is down with error. Exit message: internal error: process exited while connecting to monitor: 2018-05-09T06:38:58.467969Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-1_nfs__0/4bf1f6d3-b754-414e-85bc-78584a7f6b5d/images/877990dc-870f-4cff-9e23-3b7d3ed45cbe/2df5d314-f3b9-42fb-8984-4bfe8b22c3b8,format=qcow2,if=none,id=drive-ua-877990dc-870f-4cff-9e23-3b7d3ed45cbe,serial=877990dc-870f-4cff-9e23-3b7d3ed45cbe,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock Is another process using the image?.


Vdsm log:
2018-05-09 09:38:54,837+0300 ERROR (vm/0c85e9db) [virt.vm] (vmId='0c85e9db-fbb8-4cad-9d6b-4482024dc762') The vm start process failed (vm:943)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 887, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4067, in _completeIncomingMigration
    self.cont()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1525, in cont
    self._underlyingCont()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4189, in _underlyingCont
    self._dom.resume()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1977, in resume
    if ret == -1: raise libvirtError ('virDomainResume() failed', dom=self)
libvirtError: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock
2018-05-09 09:38:54,838+0300 INFO  (vm/0c85e9db) [virt.vm] (vmId='0c85e9db-fbb8-4cad-9d6b-4482024dc762') Changed state to Down: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock (code=1) (vm:1683)

Comment 1 Israel Pinto 2018-05-09 08:16:15 UTC
Created attachment 1433607 [details]
host-var/run logs

Comment 2 Israel Pinto 2018-05-09 08:16:50 UTC
Created attachment 1433608 [details]
host-var/log logs

Comment 3 Israel Pinto 2018-05-09 08:17:31 UTC
Created attachment 1433609 [details]
engine log

Comment 4 Israel Pinto 2018-05-09 08:19:46 UTC
Created attachment 1433614 [details]
last engine log

Comment 5 Tal Nisan 2018-05-09 11:56:19 UTC
Israel, can you please send me the mount output for the domain?

Comment 6 Israel Pinto 2018-05-09 12:05:46 UTC
mount output:
[root@master-vds10 mnt]# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=10145176k,nr_inodes=2536294,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpuacct,cpu)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,hugetlb)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,pids)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,perf_event)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,memory)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,net_prio,net_cls)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpuset)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,devices)
configfs on /sys/kernel/config type configfs (rw,relatime)
/dev/mapper/VolGroup01-root on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=27,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=303)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime,seclabel)
hugetlbfs on /dev/hugepages1G type hugetlbfs (rw,relatime,seclabel,pagesize=1G)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel)
tmpfs on /tmp type tmpfs (rw,seclabel)
/dev/sda1 on /boot type ext3 (rw,relatime,seclabel,data=ordered)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/alukiano/compute-ge-he-1 on /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nolock,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.35.80.5,mountvers=3,mountport=20048,mountproto=udp,local_lock=all,addr=10.35.80.5)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,seclabel,size=2450972k,mode=700)

Comment 7 Tal Nisan 2018-05-09 15:59:51 UTC
Kevin, I can't understand which lock fails here, could you advise?

Comment 8 Ademar Reis 2018-05-09 16:11:43 UTC
(In reply to Tal Nisan from comment #7)
> Kevin, I can't understand which lock fails here, could you advise?

Changing the NEEDINFO to Fam, who is responsible for the locking code.

Comment 9 Fam Zheng 2018-05-10 14:56:39 UTC
Besides the qemu-kvm command invocation error, I see qemu-img convert error in the logs too:

> error while converting qcow2: Failed to get "write" lock\\nIs another process using the image?\\n\')',)

I'm wondering what exactly happens in the test case. Is it possible that multiple QEMU commands (either qemu-kvm or qemu-img) are executed concurrently? That is the usual trigger of the locking error.

Is it possible to grab output of "ps aux" when the error is raised?

Comment 11 Israel Pinto 2018-05-13 14:00:35 UTC
[root@master-vds10 ~]# ps -aux | grep qemu | grep memory
qemu     17745  1.7  1.3 1890516 325164 ?      Sl   16:46   0:10 /usr/libexec/qemu-kvm -name guest=memory_hotplug_test,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-265-memory_hotplug_test/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu Conroe -m size=1048576k,slots=16,maxmem=16777216k -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0,mem=1024 -uuid 3e5c9189-4e59-456f-8277-eb40e47dc231 -smbios type=1,manufacturer=oVirt,product=RHEV Hypervisor,version=7.5-8.el7,serial=55157E90-C28C-11E0-983A-5CF3FC78F458,uuid=3e5c9189-4e59-456f-8277-eb40e47dc231 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-265-memory_hotplug_test/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2018-05-13T13:46:07,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=ua-5686e686-fe5a-427c-bfe4-cf6f9889ecd2,bus=pci.0,addr=0x6 -device virtio-serial-pci,id=ua-1396692c-e614-4f77-abed-43236884e746,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ua-0a6e36d3-853f-41ab-b466-0a2a8ddc2444,readonly=on,werror=report,rerror=report -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-0a6e36d3-853f-41ab-b466-0a2a8ddc2444,id=ua-0a6e36d3-853f-41ab-b466-0a2a8ddc2444 -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-1_nfs__0/4bf1f6d3-b754-414e-85bc-78584a7f6b5d/images/946c884e-8318-4232-ae9a-b13a8cf9bfa4/4b473dd0-4bf2-438e-9fbc-f2602781e1cd,format=qcow2,if=none,id=drive-ua-946c884e-8318-4232-ae9a-b13a8cf9bfa4,serial=946c884e-8318-4232-ae9a-b13a8cf9bfa4,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-946c884e-8318-4232-ae9a-b13a8cf9bfa4,id=ua-946c884e-8318-4232-ae9a-b13a8cf9bfa4,bootindex=1 -netdev tap,fd=33,id=hostua-bb80cd20-b3c9-447a-8087-a7d4831e4cbe,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostua-bb80cd20-b3c9-447a-8087-a7d4831e4cbe,id=ua-bb80cd20-b3c9-447a-8087-a7d4831e4cbe,mac=00:1a:4a:16:10:49,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/3e5c9189-4e59-456f-8277-eb40e47dc231.ovirt-guest-agent.0,server,nowait -device virtserialport,bus=ua-1396692c-e614-4f77-abed-43236884e746.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/3e5c9189-4e59-456f-8277-eb40e47dc231.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=ua-1396692c-e614-4f77-abed-43236884e746.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=ua-1396692c-e614-4f77-abed-43236884e746.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5901,tls-port=5902,addr=10.35.92.21,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=ua-b6bbbf35-55a6-4bc3-ab48-39c0d9343bcf,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=ua-bdc1eb5e-98ab-432a-b00e-eb34ecd8f5f5,bus=pci.0,addr=0x4 -device hda-duplex,id=ua-bdc1eb5e-98ab-432a-b00e-eb34ecd8f5f5-codec0,bus=ua-bdc1eb5e-98ab-432a-b00e-eb34ecd8f5f5.0,cad=0 -object rng-random,id=objua-38d1b121-6070-449b-9cbc-93bf94daf1b5,filename=/dev/urandom -device virtio-rng-pci,rng=objua-38d1b121-6070-449b-9cbc-93bf94daf1b5,id=ua-38d1b121-6070-449b-9cbc-93bf94daf1b5,bus=pci.0,addr=0x8 -device vmcoreinfo -msg timestamp=on


VM been paused due to storage I/O problem
May 13, 2018, 4:25:12 PM VM memory_hotplug_test is down with error. Exit message: Wake up from hibernation failed:xml in virDomainDefineXML must not be NULL.

adding logs

Comment 12 Israel Pinto 2018-05-13 14:06:01 UTC
Created attachment 1435682 [details]
logs_13_5_18

Comment 14 Israel Pinto 2018-05-14 12:37:37 UTC
I recheck the settings of the NFS (nfs_0) on setup and found the following:
1. Retransmissions was set to 2
2. Timeout was set to 1

I rerun the test with suspend resume vm with hot-plug memory.
The VM is UP after suspend in both cases (hot-plug and hot unplug memory)

I think the error/s we saw can be cause by this settings (which were leftover on the setup)

Comment 15 Tal Nisan 2018-05-14 14:46:17 UTC
Seems like it indeed, closing as not a bug

Comment 16 Ademar Reis 2018-05-17 16:49:03 UTC
Clearing old needinfo.