Description of problem: When a qemu process is non-responding (the qemu monitor commands time out) the state of the VM is not set to non-responding Version-Release number of selected component (if applicable): vdsm-4.17.26-0.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1. kill -19 <qemu_pid> Actual results: VM is marked as UP Expected results: Vm is marked as non-responding Additional info: The responsive state is set in virdomain.Notifying when libvirt returns libvirt.VIR_ERR_OPERATION_TIMEOUT. If the call to the dom object does not return the time out exception, the VM is marked as in UP state again. This is problem as some calls to the dom object are non blocking. dom.controlInfo() or dom.metadata() for instance. This information is requested pretty often and the VM is always marked as responding when these calls are triggered. Control infor is called ~2 sec.
(In reply to Roman Hodain from comment #0) > Additional info: > The responsive state is set in virdomain.Notifying when libvirt returns > libvirt.VIR_ERR_OPERATION_TIMEOUT. If the call to the dom object does not > return the time out exception, the VM is marked as in UP state again. This > is problem as some calls to the dom object are non blocking. > dom.controlInfo() or dom.metadata() for instance. This information is > requested pretty often and the VM is always marked as responding when these > calls are triggered. Control infor is called ~2 sec. This is correct, but we must make sure that libvirt itself times out, because if it doesn't, there is little Vdsm can do. Vdsm cannot be too smart here, because this can create false positives; the periodic framework (which calls -albeit indirectly- controlInfo as you reported) already have aggressive timeouts, and this as smart as we can get. So the first step is to make sure that libvirt behaves as we expect.
There is a bug in the responsiveness reporting using bulk stats. The other timeout reporting - depending on the VM calls return value seems OK. Working on a fix.
Please check https://gerrit.ovirt.org/#/c/61310/3//COMMIT_MSG to learn about the bug I found, and how the fix will address it.
(In reply to Francesco Romani from comment #5) > Please check https://gerrit.ovirt.org/#/c/61310/3//COMMIT_MSG to learn about > the bug I found, and how the fix will address it. I do not think that this address the bug. The problem is that even if you fix the issue with bulk stats. we still call dom.controlInfo() and dom.metadata(). These two call periodically sets the VM to up state event if it is completely non-responding. This is caused by [1] The Vm class reports the monitor responsiveness in two ways: 1. if a libvirt operation reports a timeout (a standard libvirt exception) 2. if one sampling operation hangs for too long virdomain.py: ... class Notifying(object): ... def __getattr__(self, name): attr = getattr(self._dom, name) if not callable(attr): return attr def f(*args, **kwargs): try: ret = attr(*args, **kwargs) self._cb(False) return ret except libvirt.libvirtError as e: if e.get_error_code() == libvirt.VIR_ERR_OPERATION_TIMEOUT: self._cb(True) toe = TimeoutError(e.get_error_message()) toe.err = e.err raise toe raise return f Here we call self._cb(True) if we get libvirt.VIR_ERR_OPERATION_TIMEOUT. __cb sets self._monitorResponse = -1 if passing True. The problem is that we also set that to false after any seccssful call. And here is the problem as dom.controlInfo() and dom.metadata() are always successful. Libvirt does not need to contact qemu in order the return the result for these calls. I tested the patches and I see the following: # virsh -r domcontrol test3 occupied (1295.299s) So there something hanging on the monitor for long time but # vdsClient -s 0 getVmStats 59e35577-57f9-4e89-8c2f-e61b7d98fff6| grep 'Status\|vmName' Status = Up vmName = test3 The engine also show the VM in up state.
There is one more interesting thing. libvirt shows occupied (1295.299s), but I expect that libvirt should timeout after 30 sec and raise the timeout exception. It actually did not happened. # grep 'periodic/0' /var/log/vdsm/vdsm.log MainThread::DEBUG::2016-07-27 14:23:05,981::executor::161::Executor::(__init__) Starting worker periodic/0 periodic/0::DEBUG::2016-07-27 14:23:05,982::executor::176::Executor::(_run) Worker started vdsm.Scheduler::DEBUG::2016-07-27 14:42:29,142::executor::216::Executor::(_discard) Worker discarded: <Worker name=periodic/0 running <Operation action=<VMBulkSampler at 0x1b4cc50> at 0x1b4cf50> discarded at 0x1b43d10> vdsm.Scheduler::DEBUG::2016-07-27 14:42:29,142::executor::116::Executor::(_worker_discarded) executor state: count=5 workers=set([<Worker name=periodic/4 waiting at 0x18f2790>, <Worker name=periodic/0 running <Operation action=<VMBulkSampler at 0x1b4cc50> at 0x1b4cf50> discarded at 0x1b43d10>, <Worker name=periodic/2 waiting at 0x1b4c190>, <Worker name=periodic/3 waiting at 0x1b4c550>, <Worker name=periodic/1 waiting at 0x1b43f10>]) So the worker actually never finished. It was just marked as discarted, but never timeouted. I suppose that this is bug for libvirt. # grep toe /var/log/vdsm/vdsm.log |wc -l 0
Ok taking back Comment 7 . As this seems to be expected. The bulkStats will never timeout. Only another calls which are waiting for the monitor lock will timeout. But if that is right and the logic is correct then we should not run any other call to the monitor as we check if the call is runnable. So we should never get the timeout exception. Am I right?
(In reply to Roman Hodain from comment #6) Roman, thanks for the followup and for the testing. > (In reply to Francesco Romani from comment #5) > > Please check https://gerrit.ovirt.org/#/c/61310/3//COMMIT_MSG to learn about > > the bug I found, and how the fix will address it. > > I do not think that this address the bug. The problem is that even if you > fix the issue with bulk stats. we still call dom.controlInfo() and > dom.metadata(). These two call periodically sets the VM to up state event if > it is completely non-responding. [...] This is true, but is a different story, much harder to fix, and dates back much earlier than bulk stats. In general, there are calls to libvirt that may not touch the QEMU monitor (metadata and controlInfo are indeed fine examples, but that's not just that), or can behave differently depending on the storage type. And this is actually one implementation detail of libvirt that we should not really consider. We should not bother to learn if some libvirt call touches or not the QEMU monitor, it is a layering violation. In this regard, our usage of controlInfo is already borderline, and we do that in the bulk stats flow out of practicality, because if we don't we could lose monitoring entirely. But also here's where the theory ends; it is also true that in the current form we report incorrect information as you reported. So let's see what we can do. > This is caused by [1] > > The Vm class reports the monitor responsiveness in two ways: > 1. if a libvirt operation reports a timeout (a standard libvirt > exception) > 2. if one sampling operation hangs for too long > > virdomain.py: > ... > class Notifying(object): > ... > def __getattr__(self, name): > attr = getattr(self._dom, name) > if not callable(attr): > return attr > > def f(*args, **kwargs): > try: > ret = attr(*args, **kwargs) > self._cb(False) > return ret > except libvirt.libvirtError as e: > if e.get_error_code() == libvirt.VIR_ERR_OPERATION_TIMEOUT: > self._cb(True) > toe = TimeoutError(e.get_error_message()) > toe.err = e.err > raise toe > raise > return f > > Here we call self._cb(True) if we get libvirt.VIR_ERR_OPERATION_TIMEOUT. > __cb sets self._monitorResponse = -1 if passing True. The problem is that we > also set that to false after any seccssful call. And here is the problem as > dom.controlInfo() and dom.metadata() are always successful. Libvirt does not > need to contact qemu in order the return the result for these calls. Yes, because we don't distinguish between calls for the reasons I explained above. > I tested the patches and I see the following: > > # virsh -r domcontrol test3 > occupied (1295.299s) > So there something hanging on the monitor for long time but > > # vdsClient -s 0 getVmStats 59e35577-57f9-4e89-8c2f-e61b7d98fff6| grep > 'Status\|vmName' > Status = Up > vmName = test3 > > The engine also show the VM in up state. So the issue is that a libvirt call which doesn't touch the monitor completes, and Vdsm is not capable of distinguish - it doesn't even try, and it should'nt. There are only two possible solutions: 1. distinguish between monitor-affecting calls, set responsive fields accordingly 2. use only controlInfo() to detect monitor responsiveness Now: #1 is too fragile and wrong. #2 is correct usage of the API which exists for this very purpose, but requires potentially invasive changes to Vdsm; the backportability of those changes is really really questionable, especially for 3.6.9. Will work on a new series of patches, with backportability in mind, and we'll see. (In reply to Roman Hodain from comment #8) > Ok taking back Comment 7 . As this seems to be expected. The bulkStats will > never timeout. Only another calls which are waiting for the monitor lock > will timeout. But if that is right and the logic is correct then we should > not run any other call to the monitor as we check if the call is runnable. > So we should never get the timeout exception. Am I right? The problem here is that the flows are asynchronous to each other (and they need to be to deliver decent performances) and both NotifyingVirDomain and sampling are used to report responsiveness.
this BZ is for 4.1, all patches merged to master -> MODIFIED.
Verified with rhevm-4.1.0.2-0.2.el7.noarch according to steps of reproduction in description. The vm stays in 'not responding' state and doesn't change at any point.