Description of problem: Failed to start VM with hot-plug memory after suspension. Version-Release number of selected component (if applicable): Engine Version:4.2.3.5-0.1.el7 Host: OS Version:RHEL - 7.5 - 8.el7 Kernel Version:3.10.0 - 862.el7.x86_64 KVM Version:2.10.0 - 21.el7_5.2 LIBVIRT Version:libvirt-3.9.0-14.el7_5.3 VDSM Version:vdsm-4.20.27.1-1.el7ev Steps to Reproduce: 1. Start VM and hotplug memory to VM 2. Suspend VM 3. Resume VM Actual results: Failed to start VM Additional info: Connection to NFS is ok [root@master-vds10 ~]# ping yellow-vdsb.qa.lab.tlv.redhat.com PING yellow-vdsb.qa.lab.tlv.redhat.com (10.35.80.5) 56(84) bytes of data. 64 bytes from yellow-vdsb.qa.lab.tlv.redhat.com (10.35.80.5): icmp_seq=1 ttl=63 time=0.160 ms 64 bytes from yellow-vdsb.qa.lab.tlv.redhat.com (10.35.80.5): icmp_seq=2 ttl=63 time=0.143 ms Engine log: 2018-05-09 09:38:56,484+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-2) [] add VM '0c85e9db-fbb8-4cad-9d6b-4482024dc762'(memory_hotplug_test) to rerun treatment 2018-05-09 09:38:56,491+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (ForkJoinPool-1-worker-2) [] Rerun VM '0c85e9db-fbb8-4cad-9d6b-4482024dc762'. Called from VDS 'host_mixed_1' 2018-05-09 09:38:56,514+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-21315) [] EVENT_ID: USER_INITIATED_RUN_VM_FAILED(151), Failed to run VM memory_hotplug_test on Host host_mixed_1. ... 2018-05-09 09:38:59,552+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-4) [] VM '0c85e9db-fbb8-4cad-9d6b-4482024dc762'(memory_hotplug_test) moved from 'WaitForLaunch' --> 'Down' 2018-05-09 09:38:59,582+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-4) [] EVENT_ID: VM_DOWN_ERROR(119), VM memory_hotplug_test is down with error. Exit message: internal error: process exited while connecting to monitor: 2018-05-09T06:38:58.467969Z qemu-kvm: -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-1_nfs__0/4bf1f6d3-b754-414e-85bc-78584a7f6b5d/images/877990dc-870f-4cff-9e23-3b7d3ed45cbe/2df5d314-f3b9-42fb-8984-4bfe8b22c3b8,format=qcow2,if=none,id=drive-ua-877990dc-870f-4cff-9e23-3b7d3ed45cbe,serial=877990dc-870f-4cff-9e23-3b7d3ed45cbe,cache=none,werror=stop,rerror=stop,aio=threads: Failed to get "write" lock Is another process using the image?. Vdsm log: 2018-05-09 09:38:54,837+0300 ERROR (vm/0c85e9db) [virt.vm] (vmId='0c85e9db-fbb8-4cad-9d6b-4482024dc762') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 887, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4067, in _completeIncomingMigration self.cont() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1525, in cont self._underlyingCont() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4189, in _underlyingCont self._dom.resume() File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1977, in resume if ret == -1: raise libvirtError ('virDomainResume() failed', dom=self) libvirtError: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock 2018-05-09 09:38:54,838+0300 INFO (vm/0c85e9db) [virt.vm] (vmId='0c85e9db-fbb8-4cad-9d6b-4482024dc762') Changed state to Down: internal error: unable to execute QEMU command 'cont': Failed to get "write" lock (code=1) (vm:1683)
Created attachment 1433607 [details] host-var/run logs
Created attachment 1433608 [details] host-var/log logs
Created attachment 1433609 [details] engine log
Created attachment 1433614 [details] last engine log
Israel, can you please send me the mount output for the domain?
mount output: [root@master-vds10 mnt]# mount sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=10145176k,nr_inodes=2536294,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,freezer) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,blkio) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpuacct,cpu) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,hugetlb) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,pids) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,perf_event) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,memory) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,net_prio,net_cls) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,cpuset) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,seclabel,devices) configfs on /sys/kernel/config type configfs (rw,relatime) /dev/mapper/VolGroup01-root on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota) selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=27,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=303) debugfs on /sys/kernel/debug type debugfs (rw,relatime) mqueue on /dev/mqueue type mqueue (rw,relatime,seclabel) hugetlbfs on /dev/hugepages1G type hugetlbfs (rw,relatime,seclabel,pagesize=1G) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel) tmpfs on /tmp type tmpfs (rw,seclabel) /dev/sda1 on /boot type ext3 (rw,relatime,seclabel,data=ordered) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime) binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime) yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/alukiano/compute-ge-he-1 on /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,nolock,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,mountaddr=10.35.80.5,mountvers=3,mountport=20048,mountproto=udp,local_lock=all,addr=10.35.80.5) tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,seclabel,size=2450972k,mode=700)
Kevin, I can't understand which lock fails here, could you advise?
(In reply to Tal Nisan from comment #7) > Kevin, I can't understand which lock fails here, could you advise? Changing the NEEDINFO to Fam, who is responsible for the locking code.
Besides the qemu-kvm command invocation error, I see qemu-img convert error in the logs too: > error while converting qcow2: Failed to get "write" lock\\nIs another process using the image?\\n\')',) I'm wondering what exactly happens in the test case. Is it possible that multiple QEMU commands (either qemu-kvm or qemu-img) are executed concurrently? That is the usual trigger of the locking error. Is it possible to grab output of "ps aux" when the error is raised?
[root@master-vds10 ~]# ps -aux | grep qemu | grep memory qemu 17745 1.7 1.3 1890516 325164 ? Sl 16:46 0:10 /usr/libexec/qemu-kvm -name guest=memory_hotplug_test,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-265-memory_hotplug_test/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu Conroe -m size=1048576k,slots=16,maxmem=16777216k -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0,mem=1024 -uuid 3e5c9189-4e59-456f-8277-eb40e47dc231 -smbios type=1,manufacturer=oVirt,product=RHEV Hypervisor,version=7.5-8.el7,serial=55157E90-C28C-11E0-983A-5CF3FC78F458,uuid=3e5c9189-4e59-456f-8277-eb40e47dc231 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-265-memory_hotplug_test/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2018-05-13T13:46:07,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=ua-5686e686-fe5a-427c-bfe4-cf6f9889ecd2,bus=pci.0,addr=0x6 -device virtio-serial-pci,id=ua-1396692c-e614-4f77-abed-43236884e746,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ua-0a6e36d3-853f-41ab-b466-0a2a8ddc2444,readonly=on,werror=report,rerror=report -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-0a6e36d3-853f-41ab-b466-0a2a8ddc2444,id=ua-0a6e36d3-853f-41ab-b466-0a2a8ddc2444 -drive file=/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_GE_compute-he-1_nfs__0/4bf1f6d3-b754-414e-85bc-78584a7f6b5d/images/946c884e-8318-4232-ae9a-b13a8cf9bfa4/4b473dd0-4bf2-438e-9fbc-f2602781e1cd,format=qcow2,if=none,id=drive-ua-946c884e-8318-4232-ae9a-b13a8cf9bfa4,serial=946c884e-8318-4232-ae9a-b13a8cf9bfa4,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-946c884e-8318-4232-ae9a-b13a8cf9bfa4,id=ua-946c884e-8318-4232-ae9a-b13a8cf9bfa4,bootindex=1 -netdev tap,fd=33,id=hostua-bb80cd20-b3c9-447a-8087-a7d4831e4cbe,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostua-bb80cd20-b3c9-447a-8087-a7d4831e4cbe,id=ua-bb80cd20-b3c9-447a-8087-a7d4831e4cbe,mac=00:1a:4a:16:10:49,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/3e5c9189-4e59-456f-8277-eb40e47dc231.ovirt-guest-agent.0,server,nowait -device virtserialport,bus=ua-1396692c-e614-4f77-abed-43236884e746.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/3e5c9189-4e59-456f-8277-eb40e47dc231.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=ua-1396692c-e614-4f77-abed-43236884e746.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=ua-1396692c-e614-4f77-abed-43236884e746.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5901,tls-port=5902,addr=10.35.92.21,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=ua-b6bbbf35-55a6-4bc3-ab48-39c0d9343bcf,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=ua-bdc1eb5e-98ab-432a-b00e-eb34ecd8f5f5,bus=pci.0,addr=0x4 -device hda-duplex,id=ua-bdc1eb5e-98ab-432a-b00e-eb34ecd8f5f5-codec0,bus=ua-bdc1eb5e-98ab-432a-b00e-eb34ecd8f5f5.0,cad=0 -object rng-random,id=objua-38d1b121-6070-449b-9cbc-93bf94daf1b5,filename=/dev/urandom -device virtio-rng-pci,rng=objua-38d1b121-6070-449b-9cbc-93bf94daf1b5,id=ua-38d1b121-6070-449b-9cbc-93bf94daf1b5,bus=pci.0,addr=0x8 -device vmcoreinfo -msg timestamp=on VM been paused due to storage I/O problem May 13, 2018, 4:25:12 PM VM memory_hotplug_test is down with error. Exit message: Wake up from hibernation failed:xml in virDomainDefineXML must not be NULL. adding logs
Created attachment 1435682 [details] logs_13_5_18
I recheck the settings of the NFS (nfs_0) on setup and found the following: 1. Retransmissions was set to 2 2. Timeout was set to 1 I rerun the test with suspend resume vm with hot-plug memory. The VM is UP after suspend in both cases (hot-plug and hot unplug memory) I think the error/s we saw can be cause by this settings (which were leftover on the setup)
Seems like it indeed, closing as not a bug
Clearing old needinfo.