Description of problem: On a cluster with OVS switch, VM that is using OVN network changes its state to 'Not Responding'. Version-Release number of selected component (if applicable): vdsm-4.20.17-1.el7ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create OVS cluster and choose ovirt-provider-ovn for external network provider. 2. Change existing host from legacy switch cluster to OVS cluster. 3. Reinstall the host (to initiate OVN host setup). 4. Attach a physical network to the host. 5. Add a VM to the cluster, and attach OVN with the physical network attached to it. 6. Start VM. 7. Stop VM. Actual results: VM becomes 'Not Responding' Expected results: Should work as expected. Additional info:
vdsm.log shows: 2018-01-28 11:01:07,015+0200 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/3 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f6bfc0d3750> at 0x7f6bfc0d3790> timeout=15, duration=15 at 0x7f6bc257fe10> task#=397 at 0x7f6bfc10c8d0>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__ self._func() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 578, in __call__ stats = hostapi.get_stats(self._cif, self._samples.stats()) File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 62, in get_stats avail, commit = _memUsageInfo(cif) File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 121, in _memUsageInfo mem_info = v.memory_info() File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 814, in memory_info dom_stats = self._dom.memoryStats() File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1510, in memoryStats ret = libvirtmod.virDomainMemoryStats(self._o) (executor:363)
Created attachment 1387135 [details] vdsm supervdsm engine logs
engine.log shows: 2018-01-28 11:01:53,501+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] Failed in 'DestroyVDS' method 2018-01-28 11:01:53,508+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] EVENT_ID: VDS_BROKER_C OMMAND_FAILURE(10,802), VDSM host_mixed_3 command DestroyVDS failed: Virtual machine destroy error 2018-01-28 11:01:53,508+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] Command 'org.ovirt.engine.core. vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturn [status=Status [code=42, message=Virtual machine destroy error]]' 2018-01-28 11:01:53,508+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] HostName = host_mixed_3 2018-01-28 11:01:53,508+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] Command 'DestroyVDSCommand(Host Name = host_mixed_3, DestroyVmVDSCommandParameters:{hostId='ac9a9fd3-f267-46a4-a450-b1cdd85ecdc1', vmId='77debc12-c0c3-40c3-8663-29768bd4cdfc', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false '})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42 2018-01-28 11:01:53,508+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] FINISH, DestroyVDSCommand, log id: 8992948 2018-01-28 11:01:53,509+02 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] Failed to destroy VM '77debc12-c0c3-40c 3-8663-29768bd4cdfc' in VDS = 'ac9a9fd3-f267-46a4-a450-b1cdd85ecdc1' , error = 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42' 2018-01-28 11:01:53,509+02 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] FINISH, DestroyVmVDSCommand, log id: 18 abc2fb 2018-01-28 11:01:53,510+02 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] Command 'org.ovirt.engine.core.bll.StopVmCommand' f ailed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42 (Failed with er ror destroyErr and code 42) 2018-01-28 11:01:53,541+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] EVENT_ID: USER_FAILED_ STOP_VM(56), Failed to power off VM test_vm (Host: host_mixed_3, User: admin@internal-authz).
I tried it also with a clean RHEL 7.5 host added directly to OVS cluster, I get the error message.
RHV version: 4.2.1.3-0.1.el7
Nothing that is specific to OVS switchType is really urgent. If a VM becomes 'Not Responsive', it is most likely a qemu or libvirt bug, rather than anything related to OVS. Is this limited to OVN networks? Can you reproduce with no vNICs at all? Can you provide versions of kernel, qemu and libvirt (as in any bug that relates to VM non-responsiveness)?
You are right, apparently it is not specific to OVS cluster at all. It is reproducible on a VM without vNICs. Kernel version: Linux vega04.qa.lab.tlv.redhat.com 3.10.0-830.el7.x86_64 #1 SMP Mon Jan 15 17:16:22 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Release: Red Hat Enterprise Linux Server 7.5 Beta (Maipo) qemu-kvm-rhev-2.10.0-18.el7.x86_64 ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch qemu-img-rhev-2.10.0-18.el7.x86_64 qemu-kvm-common-rhev-2.10.0-18.el7.x86_64 libvirt-daemon-driver-qemu-3.9.0-7.el7.x86_64 libvirt-3.9.0-7.el7.x86_64
So, it seems to me that regardless of vNICs or not, when we shut down a VM: 1. the shutdown fails 2. the VM becomes unresponsive The shutdown fails because the VM becomes unresponsive, see https://bugzilla.redhat.com/show_bug.cgi?id=1532277 Problem is why the VM becomes unresponsive in the first place. We had a (bad) bug in virDomainGetMemoryStats(), which is present since forever in libvirt but that Vdsm started to use not so long ago. let's make sure we have this fix: https://gerrit.ovirt.org/#/c/85220/ If it happens again also with the right libvirt-python bindings *and* the error is still virDomainGetMemoryStats() being slow/unresponsive, then is a libvirt issue, because that API is supposed to fetch data from procfs/sysfs (e.g. no QEMU or shared storage access), thus should be fast and safe *** This bug has been marked as a duplicate of bug 1532277 ***