Description of problem: Few issues observed on running VMs with mdev_type hook after host upgrade: 1. VMs status switched to pause mode. 2. VM failed to resume with the next engine.log 2018-04-16 11:32:36,935+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-24129) [353f0f5b-dd2f-4e7f-ad80-8d4b73dab766] Failed in 'ResumeBrokerVDS' method 3. VM failed to run after powering it off with the next engine.log 2018-04-16 11:35:44,873+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_DOWN_ERROR(119), VM vGPU_RHEL7_03 is down with error. Exit message: Wake up from hibernation failed:Virtual machine already exists. 4. VM state is running in host although it's powered off in Webadmin. 5. Nvidia instances are still connected to GPU and not cleared after powering off VM. Version-Release number of selected component (if applicable): ovirt-engine-4.2.3-0.1.el7 libvirt-client-3.9.0-14.el7_5.2.x86_64 sanlock-3.6.0-1.el7.x86_64 vdsm-4.20.25-1.el7ev.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64 NVIDIA-Linux-x86_64-390.21-vgpu-kvm kernel-3.10.0-851.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Run VM with mdev_type hook 2. Restart vdsm service 3. Actual results: 1. VM status changed to pause mode 2. VM cannot be resume 3. VM failed to run after powering it off 4. VM remains in running state in host 5. Nvidia instance is still connected to GPU Expected results: VM should continue to run and if it switched to pause mode it should be able to resume. Additional info: vdsm.log and engine.log attached
Created attachment 1422389 [details] engine.log
Created attachment 1422390 [details] vdsm.log
(In reply to Nisim Simsolo from comment #0) > Description of problem: > Few issues observed on running VMs with mdev_type hook after host upgrade: What upgrade? > 1. VMs status switched to pause mode. > 2. VM failed to resume with the next engine.log What resume? How is that related to upgrade or reproduction steps below? > > 2018-04-16 11:32:36,935+03 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] > (EE-ManagedThreadFactory-engine-Thread-24129) > [353f0f5b-dd2f-4e7f-ad80-8d4b73dab766] Failed in 'ResumeBrokerVDS' method > > 3. VM failed to run after powering it off with the next engine.log Why is the resume here again, was it powered off or suspended? > > 2018-04-16 11:35:44,873+03 ERROR > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_DOWN_ERROR(119), VM vGPU_RHEL7_03 > is down with error. Exit message: Wake up from hibernation failed:Virtual > machine already exists. > > 4. VM state is running in host although it's powered off in Webadmin. Checked how exactly? > 5. Nvidia instances are still connected to GPU and not cleared after > powering off VM. Checked how exactly? > Version-Release number of selected component (if applicable): > ovirt-engine-4.2.3-0.1.el7 > libvirt-client-3.9.0-14.el7_5.2.x86_64 > sanlock-3.6.0-1.el7.x86_64 > vdsm-4.20.25-1.el7ev.x86_64 > qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64 > NVIDIA-Linux-x86_64-390.21-vgpu-kvm > kernel-3.10.0-851.el7.x86_64 > > How reproducible: > 100% > > Steps to Reproduce: > 1. Run VM with mdev_type hook > 2. Restart vdsm service Steps do not match what you say above > Actual results: > 1. VM status changed to pause mode > 2. VM cannot be resume > 3. VM failed to run after powering it off > 4. VM remains in running state in host > 5. Nvidia instance is still connected to GPU > > Expected results: > VM should continue to run and if it switched to pause mode it should be able > to resume. > > Additional info: > vdsm.log and engine.log attached
(In reply to Michal Skrivanek from comment #3) > (In reply to Nisim Simsolo from comment #0) > > Description of problem: > > Few issues observed on running VMs with mdev_type hook after host upgrade: > > What upgrade? I meant to host update from rhv-4.2.2-10 to rhv-4.2.3-1 (VDSM updated from vdsm-4.20.23-1.el7ev.x86_64 to vdsm-4.20.25-1.el7ev.x86_64) > > > 1. VMs status switched to pause mode. > > 2. VM failed to resume with the next engine.log > > What resume? How is that related to upgrade or reproduction steps below? The VM state switched from up state to pause state after the update and it failed to be resumed. It's related because VM cannot be run anymore. > > > > > 2018-04-16 11:32:36,935+03 ERROR > > [org.ovirt.engine.core.vdsbroker.vdsbroker.ResumeBrokerVDSCommand] > > (EE-ManagedThreadFactory-engine-Thread-24129) > > [353f0f5b-dd2f-4e7f-ad80-8d4b73dab766] Failed in 'ResumeBrokerVDS' method > > > > 3. VM failed to run after powering it off with the next engine.log > > Why is the resume here again, was it powered off or suspended? I powered it off because resume failed. running it from powered off state failed also > > > > > 2018-04-16 11:35:44,873+03 ERROR > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > > (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_DOWN_ERROR(119), VM vGPU_RHEL7_03 > > is down with error. Exit message: Wake up from hibernation failed:Virtual > > machine already exists. > > > > 4. VM state is running in host although it's powered off in Webadmin. > > Checked how exactly? # virsh -r list # ps -aux | grep qemu > > > 5. Nvidia instances are still connected to GPU and not cleared after > > powering off VM. > > Checked how exactly? # nvidia-smi > > > Version-Release number of selected component (if applicable): > > ovirt-engine-4.2.3-0.1.el7 > > libvirt-client-3.9.0-14.el7_5.2.x86_64 > > sanlock-3.6.0-1.el7.x86_64 > > vdsm-4.20.25-1.el7ev.x86_64 > > qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64 > > NVIDIA-Linux-x86_64-390.21-vgpu-kvm > > kernel-3.10.0-851.el7.x86_64 > > > > How reproducible: > > 100% > > > > Steps to Reproduce: > > 1. Run VM with mdev_type hook > > 2. Restart vdsm service > > Steps do not match what you say above Same behavior observed in both cases, host update or restarting vdsm service > > > Actual results: > > 1. VM status changed to pause mode > > 2. VM cannot be resume > > 3. VM failed to run after powering it off > > 4. VM remains in running state in host > > 5. Nvidia instance is still connected to GPU > > > > Expected results: > > VM should continue to run and if it switched to pause mode it should be able > > to resume. > > > > Additional info: > > vdsm.log and engine.log attached
I can confirm this happens, and more specifically it happens due to VDSM trying to re-create hostdev object from unknown device (mdev is only known to the hook, not VDSM core). The traceback can be seen in VDSM log: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2777, in _run self._devices = self._make_devices() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2627, in _make_devices return self._make_devices_from_xml(disk_objs) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2715, in _make_devices_from_xml self.id, self.domain, self._md_desc, self.log File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/common.py", line 219, in dev_map_from_domain_xml dev_obj = dev_class.from_xml_tree(log, dev_elem, dev_meta) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/hostdevice.py", line 381, in from_xml_tree dev_name = _get_device_name(dev, dev_type) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/hostdevice.py", line 407, in _get_device_name return device_name_from_address(dev_type, src_addr) File "/usr/lib/python2.7/site-packages/vdsm/common/hostdev.py", line 552, in device_name_from_address _format_address(address_type, device_address) KeyError: 'mdev_uuid294fb789-98c7-3d30-8c92-4047252328a4'
Verified: rhvm-4.2.3.3-0.1 vdsm-4.20.27-1.el7ev.x86_64 libvirt-client-3.9.0-14.el7_5.2.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.2.x86_64 kernel-3.10.0-799.el7.x86_64 NVIDIA-Linux-x86_64-390.21 GPU type: Tesla M60 Verification scenario: 1. Run some VMs with mdev_type hook. Open Console on these VMs. 2. Restart vdsm service Verify VM continues to run, Console remains open and VMs in Webadmin remain in "up" state. 3. Reboot host. Webadmin - Verify VMs state switched to "down". 4. After host rebooted, verify all VMs state is still down. Run VMs. Verify Vms are running properly with mdev_type hook.
This bugzilla is included in oVirt 4.2.3 release, published on May 4th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.3 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.