Bug 1539360

Summary: VM becomes 'Not Responding' after shutdown
Product: [oVirt] vdsm Reporter: Mor <mkalfon>
Component: CoreAssignee: Francesco Romani <fromani>
Status: CLOSED DUPLICATE QA Contact: Mor <mkalfon>
Severity: high Docs Contact:
Priority: medium    
Version: 4.20.15CC: bugs, mkalfon
Target Milestone: ovirt-4.2.2Flags: rule-engine: ovirt-4.2+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-30 15:28:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm supervdsm engine logs none

Description Mor 2018-01-28 09:09:20 UTC
Description of problem:
On a cluster with OVS switch, VM that is using OVN network changes its state to 'Not Responding'.

Version-Release number of selected component (if applicable):
vdsm-4.20.17-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create OVS cluster and choose ovirt-provider-ovn for external network provider.
2. Change existing host from legacy switch cluster to OVS cluster.
3. Reinstall the host (to initiate OVN host setup).
4. Attach a physical network to the host.
5. Add a VM to the cluster, and attach OVN with the physical network attached to it.
6. Start VM.
7. Stop VM.

Actual results:
VM becomes 'Not Responding'

Expected results:
Should work as expected.

Additional info:

Comment 1 Mor 2018-01-28 09:12:54 UTC
vdsm.log shows:
2018-01-28 11:01:07,015+0200 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/3 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f6bfc0d3750> at 0x7f6bfc0d3790> timeout=15, duration=15 at 0x7f6bc257fe10> task#=397 at 0x7f6bfc10c8d0>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__
  self._func()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 578, in __call__
  stats = hostapi.get_stats(self._cif, self._samples.stats())
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 62, in get_stats
  avail, commit = _memUsageInfo(cif)
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 121, in _memUsageInfo
  mem_info = v.memory_info()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 814, in memory_info
  dom_stats = self._dom.memoryStats()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
  ret = attr(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
  ret = f(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
  return func(inst, *args, **kwargs)
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1510, in memoryStats
  ret = libvirtmod.virDomainMemoryStats(self._o) (executor:363)

Comment 2 Mor 2018-01-28 09:13:35 UTC
Created attachment 1387135 [details]
vdsm supervdsm engine logs

Comment 3 Mor 2018-01-28 09:16:03 UTC
engine.log shows:
2018-01-28 11:01:53,501+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] Failed in 'DestroyVDS' method
2018-01-28 11:01:53,508+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] EVENT_ID: VDS_BROKER_C
OMMAND_FAILURE(10,802), VDSM host_mixed_3 command DestroyVDS failed: Virtual machine destroy error
2018-01-28 11:01:53,508+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] Command 'org.ovirt.engine.core.
vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturn [status=Status [code=42, message=Virtual machine destroy error]]'
2018-01-28 11:01:53,508+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] HostName = host_mixed_3
2018-01-28 11:01:53,508+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] Command 'DestroyVDSCommand(Host
Name = host_mixed_3, DestroyVmVDSCommandParameters:{hostId='ac9a9fd3-f267-46a4-a450-b1cdd85ecdc1', vmId='77debc12-c0c3-40c3-8663-29768bd4cdfc', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='false
'})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42
2018-01-28 11:01:53,508+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] FINISH, DestroyVDSCommand, log 
id: 8992948
2018-01-28 11:01:53,509+02 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] Failed to destroy VM '77debc12-c0c3-40c
3-8663-29768bd4cdfc' in VDS = 'ac9a9fd3-f267-46a4-a450-b1cdd85ecdc1' , error = 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = 
Virtual machine destroy error, code = 42'
2018-01-28 11:01:53,509+02 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] FINISH, DestroyVmVDSCommand, log id: 18
abc2fb
2018-01-28 11:01:53,510+02 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] Command 'org.ovirt.engine.core.bll.StopVmCommand' f
ailed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42 (Failed with er
ror destroyErr and code 42)
2018-01-28 11:01:53,541+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-94346) [296ccc8b-3f2e-429c-9d7f-a69a9b1c9397] EVENT_ID: USER_FAILED_
STOP_VM(56), Failed to power off VM test_vm (Host: host_mixed_3, User: admin@internal-authz).

Comment 4 Mor 2018-01-28 12:42:58 UTC
I tried it also with a clean RHEL 7.5 host added directly to OVS cluster, I get the error message.

Comment 5 Mor 2018-01-28 13:23:48 UTC
RHV version: 4.2.1.3-0.1.el7

Comment 6 Dan Kenigsberg 2018-01-29 15:55:04 UTC
Nothing that is specific to OVS switchType is really urgent.

If a VM becomes 'Not Responsive', it is most likely a qemu or libvirt bug, rather than anything related to OVS.

Is this limited to OVN networks? Can you reproduce with no vNICs at all?

Can you provide versions of kernel, qemu and libvirt (as in any bug that relates to VM non-responsiveness)?

Comment 7 Mor 2018-01-29 16:15:46 UTC
You are right, apparently it is not specific to OVS cluster at all. 
It is reproducible on a VM without vNICs.

Kernel version:
Linux vega04.qa.lab.tlv.redhat.com 3.10.0-830.el7.x86_64 #1 SMP Mon Jan 15 17:16:22 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Release: Red Hat Enterprise Linux Server 7.5 Beta (Maipo)

qemu-kvm-rhev-2.10.0-18.el7.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-img-rhev-2.10.0-18.el7.x86_64
qemu-kvm-common-rhev-2.10.0-18.el7.x86_64
libvirt-daemon-driver-qemu-3.9.0-7.el7.x86_64
libvirt-3.9.0-7.el7.x86_64

Comment 8 Francesco Romani 2018-01-30 15:28:54 UTC
So, it seems to me that regardless of vNICs or not, when we shut down a VM:
1. the shutdown fails
2. the VM becomes unresponsive

The shutdown fails because the VM becomes unresponsive, see
https://bugzilla.redhat.com/show_bug.cgi?id=1532277

Problem is why the VM becomes unresponsive in the first place. We had a (bad) bug in virDomainGetMemoryStats(), which is present since forever in libvirt but that Vdsm started to use not so long ago. let's make sure we have this fix:
https://gerrit.ovirt.org/#/c/85220/

If it happens again also with the right libvirt-python bindings *and* the error is still virDomainGetMemoryStats() being slow/unresponsive, then is a libvirt issue, because that API is supposed to fetch data from procfs/sysfs (e.g. no QEMU or shared storage access), thus should be fast and safe

*** This bug has been marked as a duplicate of bug 1532277 ***