1309884 – In RHEL7, VDSM is no longer calling _destroyVmForceful() if SIGTERM fails

Bug 1309884 - In RHEL7, VDSM is no longer calling _destroyVmForceful() if SIGTERM fails

Summary: In RHEL7, VDSM is no longer calling _destroyVmForceful() if SIGTERM fails

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.7
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.0.0-alpha
Target Release:	4.0.0
Assignee:	Francesco Romani
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1330557
TreeView+	depends on / blocked

Reported:	2016-02-18 21:45 UTC by Gordon Watson
Modified:	2019-11-14 07:27 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1330557 (view as bug list)
Environment:
Last Closed:	2016-08-23 20:18:48 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:1671	normal	SHIPPED_LIVE	VDSM 4.0 GA bug fix and enhancement update	2016-09-02 21:32:03 UTC
oVirt gerrit	53747	'None'	MERGED	vm: destroy: try harder destroying a Vm	2020-10-28 02:09:05 UTC
oVirt gerrit	53931	'None'	MERGED	virt: clean and modernize the destroy() path	2020-10-28 02:09:20 UTC
oVirt gerrit	53932	'None'	MERGED	virt: extract destroyVm helper	2020-10-28 02:09:05 UTC
oVirt gerrit	55224	'None'	MERGED	vm: destroy: retry to gracefully destroy	2020-10-28 02:09:05 UTC
oVirt gerrit	55533	'None'	MERGED	vm: use response module in the destroy path	2020-10-28 02:09:20 UTC
oVirt gerrit	55534	'None'	MERGED	virt: clean and modernize the destroy() path	2020-10-28 02:09:05 UTC
oVirt gerrit	55535	'None'	MERGED	virt: extract destroyVm helper	2020-10-28 02:09:06 UTC
oVirt gerrit	55536	'None'	MERGED	vm: destroy: try harder destroying a Vm	2020-10-28 02:09:06 UTC
oVirt gerrit	55537	'None'	ABANDONED	vm: destroy: retry to gracefully destroy	2020-10-28 02:09:06 UTC

Description Gordon Watson 2016-02-18 21:45:48 UTC

Description of problem:

When powering off a VM running on a RHEL7 host via the RHEV GUI, if SIGTERM fails, then VDSM is no longer calling _destroyVmForceful(), as it does in RHEL6.

There are conditions where a VM cannot be killed by SIGTERM. VDSM attempts to handle this by checking for a specific error code, VIR_ERR_OPERATION_FAILED, returned by libvirt and then calling _destroyVmForceful() in vm.py. 

In RHEL6, libvirt returns VIR_ERR_OPERATION_FAILED, but in RHEL7 it isn't, and therefore VDSM never calls _destroyVmForceful().


Version-Release number of selected component (if applicable):

- RHEV 3.5.7
- RHEL 7.2 host;
    vdsm-4.16.32-1.el7
    libvirt-1.2.17-13.el7_2.3


How reproducible:

If a VM can be placed in a state where SIGTERM fails, then it should fail every time.


Steps to Reproduce:

1. Force libvirt to not send a SIGTERM to a guest process ( I used 'gdb' and changed the signal to something else, plus a few other things).
2. In RHEL6, VDSM will report that SIGTERM failed and then it will call _destroyVmForceful() and the guest process will get killed.
3. In RHEL7, VDSM will report that SIGTERM failed and leave the guest process as is, requiring it to be killed manually. 

Actual results:

VM needs to be killed manually.


Expected results:

VM should be killed regardless.


Additional info:

Comment 1 Gordon Watson 2016-02-18 21:53:21 UTC

-  The following are extracts from qemuDomainDestroyFlags() in libvirt's qemu_driver.c;

a) RHEL6;

    if (flags & VIR_DOMAIN_DESTROY_GRACEFUL) {
        if (qemuProcessKill(driver, vm, 0) < 0) {
            virReportError(VIR_ERR_OPERATION_FAILED, "%s",
                           _("failed to kill qemu process with SIGTERM"));
            priv->beingDestroyed = false;
            goto cleanup;
        }



b) RHEL7;

    if (flags & VIR_DOMAIN_DESTROY_GRACEFUL) {
        if (qemuProcessKill(vm, 0) < 0) {
            priv->beingDestroyed = false;
            goto cleanup;
        }


I'm not sure why this changed, I have been able to find a bug in which this was done. 



-  VDSM does the following in 'vm.py';

    def _destroyVmGraceful(self):
        try:
            self._dom.destroyFlags(libvirt.VIR_DOMAIN_DESTROY_GRACEFUL)
        except libvirt.libvirtError as e:
            # after succesfull migraions
            if (self.lastStatus == vmstatus.DOWN and
               e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN):
                self.log.info("VM '%s' already down and destroyed",
                              self.conf['vmId'])
            else:
                self.log.warning("Failed to destroy VM '%s' gracefully",
                                 self.conf['vmId'], exc_info=True)
                if e.get_error_code() == libvirt.VIR_ERR_OPERATION_FAILED:
                    return self._destroyVmForceful()
                return errCode['destroyErr']
        return {'status': doneCode}

    def _destroyVmForceful(self):
        try:
            self._dom.destroy()
        except libvirt.libvirtError:
            self.log.warning("Failed to destroy VM '%s'",
                             self.conf['vmId'], exc_info=True)
            return errCode['destroyErr']
        return {'status': doneCode}


In RHEL6 it reacts to the VIR_ERR_OPERATION_FAILED returned by libvirt and calls _destroyVmForceful().

In RHEL7, in my testing, I see it return VIR_ERR_SYSTEM_ERROR instead, and so _destroyVmForceful() doesn't get called.

Comment 4 Michal Skrivanek 2016-02-19 10:45:20 UTC

Gordon, thank you for the detailed and concise bug description

Comment 5 Francesco Romani 2016-02-19 11:11:41 UTC

(In reply to Gordon Watson from comment #0)
> Description of problem:
> 
> When powering off a VM running on a RHEL7 host via the RHEV GUI, if SIGTERM
> fails, then VDSM is no longer calling _destroyVmForceful(), as it does in
> RHEL6.
> 
> There are conditions where a VM cannot be killed by SIGTERM. VDSM attempts
> to handle this by checking for a specific error code,
> VIR_ERR_OPERATION_FAILED, returned by libvirt and then calling
> _destroyVmForceful() in vm.py. 
> 
> In RHEL6, libvirt returns VIR_ERR_OPERATION_FAILED, but in RHEL7 it isn't,
> and therefore VDSM never calls _destroyVmForceful().
> 
> 
> Version-Release number of selected component (if applicable):
> 
> - RHEV 3.5.7
> - RHEL 7.2 host;
>     vdsm-4.16.32-1.el7
>     libvirt-1.2.17-13.el7_2.3
> 
> 
> How reproducible:
> 
> If a VM can be placed in a state where SIGTERM fails, then it should fail
> every time.
> 
> 
> Steps to Reproduce:
> 
> 1. Force libvirt to not send a SIGTERM to a guest process ( I used 'gdb' and
> changed the signal to something else, plus a few other things).
> 2. In RHEL6, VDSM will report that SIGTERM failed and then it will call
> _destroyVmForceful() and the guest process will get killed.
> 3. In RHEL7, VDSM will report that SIGTERM failed and leave the guest
> process as is, requiring it to be killed manually. 
> 
> Actual results:
> 
> VM needs to be killed manually.
> 
> 
> Expected results:
> 
> VM should be killed regardless.
> 
> 
> Additional info:

Thanks for the very good bug report. Looking at the code, I can't recall a good reason for doing logic on return value. Vdsm should try harder to destroy a Vm, after all is what Engine requested, instead of catching up with libvirt return codes. Let's start getting rid of this logic in the destroy path.

Comment 6 Dan Kenigsberg 2016-03-01 09:07:00 UTC

Bug 795656 was opened since we used to kill VMs brutally, before even trying to notify the guest. I'd like not to repeat it, and I am afraid that VIR_ERR_SYSTEM_ERROR is a too general error code. If received from libvirt, I am not sure whether a graceful attempt has taken place.

Jiri, can we make sure that there was an attempt to notify the guest of the imminent shutdown?

Comment 7 Jiri Denemark 2016-03-03 12:02:55 UTC

Yes, VIR_ERR_SYSTEM_ERROR is a pretty good indication that libvirt tried to kill the process.

However, I'm still not sure what semantics you want to implement. The current semantics of _destroyVmGraceful doesn't seem to be very clear. IMHO it should either try to kill a vm gracefully and return an error if it fails, or it should always kill forcefully if graceful attempt fails. Having an API which tries to kill a vm gracefully and sometimes kills it forcefully if the graceful attempt failed looks pretty strange to me.

Comment 8 Francesco Romani 2016-03-22 15:55:03 UTC

(In reply to Jiri Denemark from comment #7)
> Yes, VIR_ERR_SYSTEM_ERROR is a pretty good indication that libvirt tried to
> kill the process.

Fine! I was looking again at this issue, and I come to think that trying again in presence of such error could perhaps only cure the symptoms, not the real disease.

I mean: if we asked libvirt to kill a VM gently, and this failed for VIR_ERR_SYSTEM_ERROR, that most likely means that ultimately kill(2) failed;
In turn, the most likely cause for such a basic syscall to fail is that the process was stuck in the kernel, perhapse in Z state.

If so, would another destroy attempt be beneficial? What does a well-behaving libvirt client do in this case? Retrying harder (virDomainDestroy() - without the VIR_DOMAIN_DESTROY_GRACEFUL) is indeed all what we can do?

Jiri, any advice for this case?
 
> However, I'm still not sure what semantics you want to implement. The
> current semantics of _destroyVmGraceful doesn't seem to be very clear. IMHO
> it should either try to kill a vm gracefully and return an error if it
> fails, or it should always kill forcefully if graceful attempt fails. Having
> an API which tries to kill a vm gracefully and sometimes kills it forcefully
> if the graceful attempt failed looks pretty strange to me.

Indeed! This has be improved in related patches.

Comment 9 Jiri Denemark 2016-03-24 15:36:50 UTC

(In reply to Francesco Romani from comment #8)
> I mean: if we asked libvirt to kill a VM gently, and this failed for
> VIR_ERR_SYSTEM_ERROR, that most likely means that ultimately kill(2) failed;

Well if kill failed and you try again without VIR_DOMAIN_DESTROY_GRACEFUL, it will fail again (unless the reason for the failure is gone). If the failure was caused by a timeout, chances are the next kill will succeed (or not if the process is stuck in D state, for example).

> If so, would another destroy attempt be beneficial? What does a
> well-behaving libvirt client do in this case? Retrying harder
> (virDomainDestroy() - without the VIR_DOMAIN_DESTROY_GRACEFUL) is indeed all
> what we can do?

It depends on what such client wants to do. If it wants to kill the domain while still trying to be nice, it will probably do virDomainDestroy(VIR_DOMAIN_DESTROY_GRACEFUL) followed by virDomainDestroy(). If the client wants to avoid any possible data loss or corruptions, it should never call virDomainDestroy without GRACEFUL flag.

Comment 10 Francesco Romani 2016-03-24 16:55:38 UTC

Thanks Jiri, will work in this direction (https://gerrit.ovirt.org/#/c/55224/)

Comment 11 Mike McCune 2016-03-28 22:37:22 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 12 Francesco Romani 2016-04-01 09:43:50 UTC

at this moment, only patch https://gerrit.ovirt.org/#/c/55537/ is not merged.
However, https://gerrit.ovirt.org/#/c/55537/ is not needed to fix this bug (but it is a related improvement), and it is scheduled for merge anyway.

So it is fair to move this bug as MODIFIED.

Comment 15 sefi litmanovich 2016-06-07 14:05:26 UTC

Verified with:
ENGINE: rhevm-4.0.0.2-0.1.el7ev.noarch
HOST: vdsm-4.18.999-31.gitb0bc4c6.el7.centos.x86_64

Followed Gordon's steps on comment 3 using gdb to invoke '_destroyVmGraceful' failure with VIR_ERR_SYSTEM_ERROR.
Verified that failure invoked the call to '_destroyVmForceful' and eventually destroyed the vm.

Comment 17 errata-xmlrpc 2016-08-23 20:18:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1671.html

Note You need to log in before you can comment on or make changes to this bug.