1357798 – VMs are not reported as non-responding even though qemu process does not responds.

Bug 1357798 - VMs are not reported as non-responding even though qemu process does not responds.

Summary: VMs are not reported as non-responding even though qemu process does not res...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.6.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.1.0-alpha
Target Release:	---
Assignee:	Francesco Romani
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1360986 1360990
TreeView+	depends on / blocked

Reported:	2016-07-19 08:13 UTC by Roman Hodain
Modified:	2020-07-16 08:53 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	This update fixes an error in the monitoring code that caused VDSM to incorrectly report that a QEMU process had recovered and was responsive after being unavailable for a short amount of time, while it was actually unresponsive.
Clone Of:
Clones:	1360986 1360990 (view as bug list)
Environment:
Last Closed:	2017-04-25 00:47:46 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:0998	normal	SHIPPED_LIVE	VDSM bug fix and enhancement update 4.1 GA	2017-04-18 20:11:39 UTC
oVirt gerrit	61309	master	MERGED	tests: sampling: add FakeClock helper	2020-08-03 23:30:46 UTC
oVirt gerrit	61310	master	MERGED	vm: periodic: fix stats age reporting	2020-08-03 23:30:45 UTC
oVirt gerrit	61420	master	MERGED	virt: sampling: add is_empty() method to StatsSample	2020-08-03 23:30:45 UTC
oVirt gerrit	61592	master	ABANDONED	virt: periodic health check - WIP	2020-08-03 23:30:45 UTC
oVirt gerrit	61820	ovirt-4.0	MERGED	tests: sampling: add FakeClock helper	2020-08-03 23:30:45 UTC
oVirt gerrit	61821	ovirt-4.0	MERGED	virt: sampling: add is_empty() method to StatsSample	2020-08-03 23:30:45 UTC
oVirt gerrit	61822	ovirt-4.0	MERGED	vm: periodic: fix stats age reporting	2020-08-03 23:30:45 UTC

Description Roman Hodain 2016-07-19 08:13:14 UTC

Description of problem:
     When a qemu process is non-responding (the qemu monitor commands time out) the state of the VM is not set to non-responding

Version-Release number of selected component (if applicable):
     vdsm-4.17.26-0.el7ev.noarch

How reproducible:
     100%

Steps to Reproduce:
     1. kill -19 <qemu_pid>

Actual results:
     VM is marked as UP

Expected results:
     Vm is marked as non-responding

Additional info:
     The responsive state is set in virdomain.Notifying when libvirt returns libvirt.VIR_ERR_OPERATION_TIMEOUT. If the call to the dom object does not return the time out exception, the VM is marked as in UP state again. This is problem as some calls to the dom object are non blocking. dom.controlInfo() or dom.metadata() for instance. This information is requested pretty often and the VM is always marked as responding when these calls are triggered. Control infor is called ~2 sec.

Comment 2 Francesco Romani 2016-07-20 06:43:14 UTC

(In reply to Roman Hodain from comment #0)
> Additional info:
>      The responsive state is set in virdomain.Notifying when libvirt returns
> libvirt.VIR_ERR_OPERATION_TIMEOUT. If the call to the dom object does not
> return the time out exception, the VM is marked as in UP state again. This
> is problem as some calls to the dom object are non blocking.
> dom.controlInfo() or dom.metadata() for instance. This information is
> requested pretty often and the VM is always marked as responding when these
> calls are triggered. Control infor is called ~2 sec.

This is correct, but we must make sure that libvirt itself times out, because if it doesn't, there is little Vdsm can do.
Vdsm cannot be too smart here, because this can create false positives; the periodic framework (which calls -albeit indirectly- controlInfo as you reported) already have aggressive timeouts, and this as smart as we can get.

So the first step is to make sure that libvirt behaves as we expect.

Comment 4 Francesco Romani 2016-07-25 09:19:28 UTC

There is a bug in the responsiveness reporting using bulk stats.
The other timeout reporting - depending on the VM calls return value seems OK.

Working on a fix.

Comment 5 Francesco Romani 2016-07-26 12:31:10 UTC

Please check https://gerrit.ovirt.org/#/c/61310/3//COMMIT_MSG to learn about the bug I found, and how the fix will address it.

Comment 6 Roman Hodain 2016-07-27 13:17:02 UTC

(In reply to Francesco Romani from comment #5)
> Please check https://gerrit.ovirt.org/#/c/61310/3//COMMIT_MSG to learn about
> the bug I found, and how the fix will address it.

I do not think that this address the bug. The problem is that even if you fix the issue with bulk stats. we still call dom.controlInfo() and dom.metadata(). These two call periodically sets the VM to up state event if it is completely non-responding.

This is caused by [1]

The Vm class reports the monitor responsiveness in two ways:
1. if a libvirt operation reports a timeout (a standard libvirt
exception)
2. if one sampling operation hangs for too long

virdomain.py:
...
class Notifying(object):
...
    def __getattr__(self, name):
        attr = getattr(self._dom, name)
        if not callable(attr):
            return attr

        def f(*args, **kwargs):
            try:
                ret = attr(*args, **kwargs)
                self._cb(False)
                return ret
            except libvirt.libvirtError as e:
                if e.get_error_code() == libvirt.VIR_ERR_OPERATION_TIMEOUT:
                    self._cb(True)
                    toe = TimeoutError(e.get_error_message())
                    toe.err = e.err
                    raise toe
                raise
        return f

Here we call self._cb(True) if we get libvirt.VIR_ERR_OPERATION_TIMEOUT. __cb sets self._monitorResponse = -1 if passing True. The problem is that we also set that to false after any seccssful call. And here is the problem as dom.controlInfo() and dom.metadata() are always successful. Libvirt does not need to contact qemu in order the return the result for these calls. 

I tested the patches and I see the following:

    # virsh -r domcontrol test3
      occupied (1295.299s)
So there something hanging on the monitor for long time but

    # vdsClient -s 0 getVmStats 59e35577-57f9-4e89-8c2f-e61b7d98fff6| grep 'Status\|vmName'
	Status = Up
	vmName = test3

The engine also show the VM in up state.

Comment 7 Roman Hodain 2016-07-27 13:28:46 UTC

There is one more interesting thing. libvirt shows occupied (1295.299s), but I expect that libvirt should timeout after 30 sec and raise the timeout exception. It actually did not happened.

# grep 'periodic/0' /var/log/vdsm/vdsm.log 
MainThread::DEBUG::2016-07-27 14:23:05,981::executor::161::Executor::(__init__) Starting worker periodic/0
periodic/0::DEBUG::2016-07-27 14:23:05,982::executor::176::Executor::(_run) Worker started
vdsm.Scheduler::DEBUG::2016-07-27 14:42:29,142::executor::216::Executor::(_discard) Worker discarded: <Worker name=periodic/0 running <Operation action=<VMBulkSampler at 0x1b4cc50> at 0x1b4cf50> discarded at 0x1b43d10>
vdsm.Scheduler::DEBUG::2016-07-27 14:42:29,142::executor::116::Executor::(_worker_discarded) executor state: count=5 workers=set([<Worker name=periodic/4 waiting at 0x18f2790>, <Worker name=periodic/0 running <Operation action=<VMBulkSampler at 0x1b4cc50> at 0x1b4cf50> discarded at 0x1b43d10>, <Worker name=periodic/2 waiting at 0x1b4c190>, <Worker name=periodic/3 waiting at 0x1b4c550>, <Worker name=periodic/1 waiting at 0x1b43f10>])

So the worker actually never finished. It was just marked as discarted, but never timeouted. I suppose that this is bug for libvirt.

# grep toe /var/log/vdsm/vdsm.log |wc -l
0

Comment 8 Roman Hodain 2016-07-27 13:56:16 UTC

Ok taking back Comment 7 . As this seems to be expected. The bulkStats will never timeout. Only another calls which are waiting for the monitor lock will timeout. But if that is right and the logic is correct then we should not run any other call to the monitor as we check if the call is runnable. So we should never get the timeout exception. Am I right?

Comment 9 Francesco Romani 2016-07-27 14:08:53 UTC

(In reply to Roman Hodain from comment #6)

Roman, thanks for the followup and for the testing.

> (In reply to Francesco Romani from comment #5)
> > Please check https://gerrit.ovirt.org/#/c/61310/3//COMMIT_MSG to learn about
> > the bug I found, and how the fix will address it.
> 
> I do not think that this address the bug. The problem is that even if you
> fix the issue with bulk stats. we still call dom.controlInfo() and
> dom.metadata(). These two call periodically sets the VM to up state event if
> it is completely non-responding. [...]

This is true, but is a different story, much harder to fix, and dates back much earlier than bulk stats.
In general, there are calls to libvirt that may not touch the QEMU monitor (metadata and controlInfo are indeed fine examples, but that's not just that), or can behave differently depending on the storage type.

And this is actually one implementation detail of libvirt that we should not really consider.
We should not bother to learn if some libvirt call touches or not the QEMU monitor, it is a layering violation.

In this regard, our usage of controlInfo is already borderline, and we do that in the bulk stats flow out of practicality, because if we don't we could lose monitoring entirely.

But also here's where the theory ends; it is also true that in the current form we report incorrect information as you reported. So let's see what we can do.

> This is caused by [1]
> 
> The Vm class reports the monitor responsiveness in two ways:
> 1. if a libvirt operation reports a timeout (a standard libvirt
> exception)
> 2. if one sampling operation hangs for too long
> 
> virdomain.py:
> ...
> class Notifying(object):
> ...
>     def __getattr__(self, name):
>         attr = getattr(self._dom, name)
>         if not callable(attr):
>             return attr
> 
>         def f(*args, **kwargs):
>             try:
>                 ret = attr(*args, **kwargs)
>                 self._cb(False)
>                 return ret
>             except libvirt.libvirtError as e:
>                 if e.get_error_code() == libvirt.VIR_ERR_OPERATION_TIMEOUT:
>                     self._cb(True)
>                     toe = TimeoutError(e.get_error_message())
>                     toe.err = e.err
>                     raise toe
>                 raise
>         return f
> 
> Here we call self._cb(True) if we get libvirt.VIR_ERR_OPERATION_TIMEOUT.
> __cb sets self._monitorResponse = -1 if passing True. The problem is that we
> also set that to false after any seccssful call. And here is the problem as
> dom.controlInfo() and dom.metadata() are always successful. Libvirt does not
> need to contact qemu in order the return the result for these calls. 

Yes, because we don't distinguish between calls for the reasons I explained above.

> I tested the patches and I see the following:
> 
>     # virsh -r domcontrol test3
>       occupied (1295.299s)
> So there something hanging on the monitor for long time but
> 
>     # vdsClient -s 0 getVmStats 59e35577-57f9-4e89-8c2f-e61b7d98fff6| grep
> 'Status\|vmName'
> 	Status = Up
> 	vmName = test3
> 
> The engine also show the VM in up state.

So the issue is that a libvirt call which doesn't touch the monitor completes, and Vdsm is not capable of distinguish - it doesn't even try, and it should'nt.
There are only two possible solutions:
1. distinguish between monitor-affecting calls, set responsive fields accordingly
2. use only controlInfo() to detect monitor responsiveness

Now: #1 is too fragile and wrong. #2 is correct usage of the API which exists for this very purpose, but requires potentially invasive changes to Vdsm;
the backportability of those changes is really really questionable, especially for 3.6.9.

Will work on a new series of patches, with backportability in mind, and we'll see.

(In reply to Roman Hodain from comment #8)
> Ok taking back Comment 7 . As this seems to be expected. The bulkStats will
> never timeout. Only another calls which are waiting for the monitor lock
> will timeout. But if that is right and the logic is correct then we should
> not run any other call to the monitor as we check if the call is runnable.
> So we should never get the timeout exception. Am I right?

The problem here is that the flows are asynchronous to each other (and they need to be to deliver decent performances) and both NotifyingVirDomain and sampling are used to report responsiveness.

Comment 12 Francesco Romani 2016-08-12 07:25:57 UTC

this BZ is for 4.1, all patches merged to master -> MODIFIED.

Comment 15 sefi litmanovich 2017-01-26 16:21:42 UTC

Verified with rhevm-4.1.0.2-0.2.el7.noarch according to steps of reproduction in description.
The vm stays in 'not responding' state and doesn't change at any point.

Note You need to log in before you can comment on or make changes to this bug.