Bug 1309884

Summary: In RHEL7, VDSM is no longer calling _destroyVmForceful() if SIGTERM fails
Product: Red Hat Enterprise Virtualization Manager Reporter: Gordon Watson <gwatson>
Component: vdsmAssignee: Francesco Romani <fromani>
Status: CLOSED ERRATA QA Contact: sefi litmanovich <slitmano>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.7CC: aandrade, audgiri, bazulay, danken, gklein, jdenemar, lsurette, mgoldboi, michal.skrivanek, rgroten, srevivo, tomas.vanderka, ycui, ykaul
Target Milestone: ovirt-4.0.0-alphaKeywords: ZStream
Target Release: 4.0.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1330557 (view as bug list) Environment:
Last Closed: 2016-08-23 20:18:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1330557    

Description Gordon Watson 2016-02-18 21:45:48 UTC
Description of problem:

When powering off a VM running on a RHEL7 host via the RHEV GUI, if SIGTERM fails, then VDSM is no longer calling _destroyVmForceful(), as it does in RHEL6.

There are conditions where a VM cannot be killed by SIGTERM. VDSM attempts to handle this by checking for a specific error code, VIR_ERR_OPERATION_FAILED, returned by libvirt and then calling _destroyVmForceful() in vm.py. 

In RHEL6, libvirt returns VIR_ERR_OPERATION_FAILED, but in RHEL7 it isn't, and therefore VDSM never calls _destroyVmForceful().


Version-Release number of selected component (if applicable):

- RHEV 3.5.7
- RHEL 7.2 host;
    vdsm-4.16.32-1.el7
    libvirt-1.2.17-13.el7_2.3


How reproducible:

If a VM can be placed in a state where SIGTERM fails, then it should fail every time.


Steps to Reproduce:

1. Force libvirt to not send a SIGTERM to a guest process ( I used 'gdb' and changed the signal to something else, plus a few other things).
2. In RHEL6, VDSM will report that SIGTERM failed and then it will call _destroyVmForceful() and the guest process will get killed.
3. In RHEL7, VDSM will report that SIGTERM failed and leave the guest process as is, requiring it to be killed manually. 

Actual results:

VM needs to be killed manually.


Expected results:

VM should be killed regardless.


Additional info:

Comment 1 Gordon Watson 2016-02-18 21:53:21 UTC
-  The following are extracts from qemuDomainDestroyFlags() in libvirt's qemu_driver.c;

a) RHEL6;

    if (flags & VIR_DOMAIN_DESTROY_GRACEFUL) {
        if (qemuProcessKill(driver, vm, 0) < 0) {
            virReportError(VIR_ERR_OPERATION_FAILED, "%s",
                           _("failed to kill qemu process with SIGTERM"));
            priv->beingDestroyed = false;
            goto cleanup;
        }



b) RHEL7;

    if (flags & VIR_DOMAIN_DESTROY_GRACEFUL) {
        if (qemuProcessKill(vm, 0) < 0) {
            priv->beingDestroyed = false;
            goto cleanup;
        }


I'm not sure why this changed, I have been able to find a bug in which this was done. 



-  VDSM does the following in 'vm.py';

    def _destroyVmGraceful(self):
        try:
            self._dom.destroyFlags(libvirt.VIR_DOMAIN_DESTROY_GRACEFUL)
        except libvirt.libvirtError as e:
            # after succesfull migraions
            if (self.lastStatus == vmstatus.DOWN and
               e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN):
                self.log.info("VM '%s' already down and destroyed",
                              self.conf['vmId'])
            else:
                self.log.warning("Failed to destroy VM '%s' gracefully",
                                 self.conf['vmId'], exc_info=True)
                if e.get_error_code() == libvirt.VIR_ERR_OPERATION_FAILED:
                    return self._destroyVmForceful()
                return errCode['destroyErr']
        return {'status': doneCode}

    def _destroyVmForceful(self):
        try:
            self._dom.destroy()
        except libvirt.libvirtError:
            self.log.warning("Failed to destroy VM '%s'",
                             self.conf['vmId'], exc_info=True)
            return errCode['destroyErr']
        return {'status': doneCode}


In RHEL6 it reacts to the VIR_ERR_OPERATION_FAILED returned by libvirt and calls _destroyVmForceful().

In RHEL7, in my testing, I see it return VIR_ERR_SYSTEM_ERROR instead, and so _destroyVmForceful() doesn't get called.

Comment 4 Michal Skrivanek 2016-02-19 10:45:20 UTC
Gordon, thank you for the detailed and concise bug description

Comment 5 Francesco Romani 2016-02-19 11:11:41 UTC
(In reply to Gordon Watson from comment #0)
> Description of problem:
> 
> When powering off a VM running on a RHEL7 host via the RHEV GUI, if SIGTERM
> fails, then VDSM is no longer calling _destroyVmForceful(), as it does in
> RHEL6.
> 
> There are conditions where a VM cannot be killed by SIGTERM. VDSM attempts
> to handle this by checking for a specific error code,
> VIR_ERR_OPERATION_FAILED, returned by libvirt and then calling
> _destroyVmForceful() in vm.py. 
> 
> In RHEL6, libvirt returns VIR_ERR_OPERATION_FAILED, but in RHEL7 it isn't,
> and therefore VDSM never calls _destroyVmForceful().
> 
> 
> Version-Release number of selected component (if applicable):
> 
> - RHEV 3.5.7
> - RHEL 7.2 host;
>     vdsm-4.16.32-1.el7
>     libvirt-1.2.17-13.el7_2.3
> 
> 
> How reproducible:
> 
> If a VM can be placed in a state where SIGTERM fails, then it should fail
> every time.
> 
> 
> Steps to Reproduce:
> 
> 1. Force libvirt to not send a SIGTERM to a guest process ( I used 'gdb' and
> changed the signal to something else, plus a few other things).
> 2. In RHEL6, VDSM will report that SIGTERM failed and then it will call
> _destroyVmForceful() and the guest process will get killed.
> 3. In RHEL7, VDSM will report that SIGTERM failed and leave the guest
> process as is, requiring it to be killed manually. 
> 
> Actual results:
> 
> VM needs to be killed manually.
> 
> 
> Expected results:
> 
> VM should be killed regardless.
> 
> 
> Additional info:

Thanks for the very good bug report. Looking at the code, I can't recall a good reason for doing logic on return value. Vdsm should try harder to destroy a Vm, after all is what Engine requested, instead of catching up with libvirt return codes. Let's start getting rid of this logic in the destroy path.

Comment 6 Dan Kenigsberg 2016-03-01 09:07:00 UTC
Bug 795656 was opened since we used to kill VMs brutally, before even trying to notify the guest. I'd like not to repeat it, and I am afraid that VIR_ERR_SYSTEM_ERROR is a too general error code. If received from libvirt, I am not sure whether a graceful attempt has taken place.

Jiri, can we make sure that there was an attempt to notify the guest of the imminent shutdown?

Comment 7 Jiri Denemark 2016-03-03 12:02:55 UTC
Yes, VIR_ERR_SYSTEM_ERROR is a pretty good indication that libvirt tried to kill the process.

However, I'm still not sure what semantics you want to implement. The current semantics of _destroyVmGraceful doesn't seem to be very clear. IMHO it should either try to kill a vm gracefully and return an error if it fails, or it should always kill forcefully if graceful attempt fails. Having an API which tries to kill a vm gracefully and sometimes kills it forcefully if the graceful attempt failed looks pretty strange to me.

Comment 8 Francesco Romani 2016-03-22 15:55:03 UTC
(In reply to Jiri Denemark from comment #7)
> Yes, VIR_ERR_SYSTEM_ERROR is a pretty good indication that libvirt tried to
> kill the process.

Fine! I was looking again at this issue, and I come to think that trying again in presence of such error could perhaps only cure the symptoms, not the real disease.

I mean: if we asked libvirt to kill a VM gently, and this failed for VIR_ERR_SYSTEM_ERROR, that most likely means that ultimately kill(2) failed;
In turn, the most likely cause for such a basic syscall to fail is that the process was stuck in the kernel, perhapse in Z state.

If so, would another destroy attempt be beneficial? What does a well-behaving libvirt client do in this case? Retrying harder (virDomainDestroy() - without the VIR_DOMAIN_DESTROY_GRACEFUL) is indeed all what we can do?

Jiri, any advice for this case?
 
> However, I'm still not sure what semantics you want to implement. The
> current semantics of _destroyVmGraceful doesn't seem to be very clear. IMHO
> it should either try to kill a vm gracefully and return an error if it
> fails, or it should always kill forcefully if graceful attempt fails. Having
> an API which tries to kill a vm gracefully and sometimes kills it forcefully
> if the graceful attempt failed looks pretty strange to me.

Indeed! This has be improved in related patches.

Comment 9 Jiri Denemark 2016-03-24 15:36:50 UTC
(In reply to Francesco Romani from comment #8)
> I mean: if we asked libvirt to kill a VM gently, and this failed for
> VIR_ERR_SYSTEM_ERROR, that most likely means that ultimately kill(2) failed;

Well if kill failed and you try again without VIR_DOMAIN_DESTROY_GRACEFUL, it will fail again (unless the reason for the failure is gone). If the failure was caused by a timeout, chances are the next kill will succeed (or not if the process is stuck in D state, for example).

> If so, would another destroy attempt be beneficial? What does a
> well-behaving libvirt client do in this case? Retrying harder
> (virDomainDestroy() - without the VIR_DOMAIN_DESTROY_GRACEFUL) is indeed all
> what we can do?

It depends on what such client wants to do. If it wants to kill the domain while still trying to be nice, it will probably do virDomainDestroy(VIR_DOMAIN_DESTROY_GRACEFUL) followed by virDomainDestroy(). If the client wants to avoid any possible data loss or corruptions, it should never call virDomainDestroy without GRACEFUL flag.

Comment 10 Francesco Romani 2016-03-24 16:55:38 UTC
Thanks Jiri, will work in this direction (https://gerrit.ovirt.org/#/c/55224/)

Comment 11 Mike McCune 2016-03-28 22:37:22 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 12 Francesco Romani 2016-04-01 09:43:50 UTC
at this moment, only patch https://gerrit.ovirt.org/#/c/55537/ is not merged.
However, https://gerrit.ovirt.org/#/c/55537/ is not needed to fix this bug (but it is a related improvement), and it is scheduled for merge anyway.

So it is fair to move this bug as MODIFIED.

Comment 15 sefi litmanovich 2016-06-07 14:05:26 UTC
Verified with:
ENGINE: rhevm-4.0.0.2-0.1.el7ev.noarch
HOST: vdsm-4.18.999-31.gitb0bc4c6.el7.centos.x86_64

Followed Gordon's steps on comment 3 using gdb to invoke '_destroyVmGraceful' failure with VIR_ERR_SYSTEM_ERROR.
Verified that failure invoked the call to '_destroyVmForceful' and eventually destroyed the vm.

Comment 17 errata-xmlrpc 2016-08-23 20:18:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1671.html