Description of problem: Suspending VM with NVDIMM device is hanging forever (in "saving" state), eventually, this VM cannot be used again, and there's no reasonable workaround for using this VM again (Any action on the VM such as power off, or resume is grayed out and host cannot be moved to maintenance). Version-Release number of selected component (if applicable): ovirt-engine-4.4.3.8-0.1.el8ev.noarch vdsm-4.40.35-1.el8ev.x86_64 libvirt-daemon-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64 qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea.x86_64 How reproducible: 100% Steps to Reproduce: 1. Run VM with NVDIMM device attached. 2. Suspend VM. 3. Actual results: Suspending VM hangs forever Expected results: Suspend Vm should succeed, or declined by WebAdmin if suspending VM with NVDIMM should not be allowed. Additional info: Attached vdsm.log and engine.log (VM suspended at: 2020-11-15 14:54:22,371+02 INFO [org.ovirt.engine.core.bll.HibernateVmCommand] (EE-ManagedThreadFactory-engine-Thread-546738) [5c759d51-ca38-478c-a50b-6d33e0b94ec6] Running command: HibernateVmCommand internal: false. Entities affected : ID: a125d2eb-91fa-4399-b931-6c1aea6d9d55 Type: VMAction group HIBERNATE_VM with role type USER
Created attachment 1729522 [details] qemu log
Created attachment 1729523 [details] vdsm.log
Created attachment 1729524 [details] engine.log
It works for me, with a much more smaller NVDIMM (~5 GB) than yours (~256 GB). Saving state takes about half minute for my fully emulated device. I can see in the attached logs that it's still saving state after more than hour in your case. There is no relevant error and no end. Nisim, what NVDIMM modes did you use on the host and in the guest? And it is a hardware device, right? How long did you actually wait? Would it be possible to retest with fsdax and devdax modes (the latter requires switching SELinux to permissive mode)? I think it would be interesting to see if it happens with those modes too.
> Nisim, what NVDIMM modes did you use on the host and in the guest? And it is > a hardware device, right? HW device, in devdax mode. > How long did you actually wait? more than 2 hours > Would it be possible to retest with fsdax and devdax modes (the latter requires > switching SELinux to permissive mode)? I think it would be interesting to > see if it happens with those modes too. Yes, I will update you with the outcome.
Nisim, did you have a chance to test with the different modes already?
(In reply to Milan Zamazal from comment #6) > Nisim, did you have a chance to test with the different modes already? Yes, it behaves the same when using fsdax and devdax (with permissive SELinux)
Thanks for testing, a QEMU bug filed: https://bugzilla.redhat.com/1902691
Let's disable suspending VMs with NVDIMMs for now, see Bug 1912426. We will handle this bug and enable suspending VMs with NVDIMMs again once a platform fix is available.
Verified with libvirt upstream code version v7.0.0-rc1 & qemu-kvm-5.1.0-17.module+el8.3.1+9213+7ace09c3.x86_64 Start vm with below xml - <memory model='nvdimm' access='shared'> <source> <path>/dev/dax0.0</path> <alignsize unit='KiB'>2048</alignsize> <pmem/> </source> <target> <size unit='KiB'>262144000</size> <node>0</node> <label> <size unit='KiB'>128</size> </label> </target> <address type='dimm' slot='0'/> </memory> From the qemu-cmd line, there is no "prealloc" and there is no long waiting time when issuing "start vm" command. -object memory-backend-file,id=memnvdimm0,mem-path=/dev/dax0.0,share=yes,size=268435456000,align=2097152,pmem=yes -device nvdimm,node=0,label-size=131072,memdev=memnvdimm0,id=nvdimm0,slot=0
The bot shouldn't change the status just because a bug is mentioned anywhere in the commit message...
Closing since the platform bugs have not been prioritized for el8 If we upgrade to el9 and the depended platform bugs are resolved, we should do that