Description of problem: When powering off a VM running on a RHEL7 host via the RHEV GUI, if SIGTERM fails, then VDSM is no longer calling _destroyVmForceful(), as it does in RHEL6. There are conditions where a VM cannot be killed by SIGTERM. VDSM attempts to handle this by checking for a specific error code, VIR_ERR_OPERATION_FAILED, returned by libvirt and then calling _destroyVmForceful() in vm.py. In RHEL6, libvirt returns VIR_ERR_OPERATION_FAILED, but in RHEL7 it isn't, and therefore VDSM never calls _destroyVmForceful(). Version-Release number of selected component (if applicable): - RHEV 3.5.7 - RHEL 7.2 host; vdsm-4.16.32-1.el7 libvirt-1.2.17-13.el7_2.3 How reproducible: If a VM can be placed in a state where SIGTERM fails, then it should fail every time. Steps to Reproduce: 1. Force libvirt to not send a SIGTERM to a guest process ( I used 'gdb' and changed the signal to something else, plus a few other things). 2. In RHEL6, VDSM will report that SIGTERM failed and then it will call _destroyVmForceful() and the guest process will get killed. 3. In RHEL7, VDSM will report that SIGTERM failed and leave the guest process as is, requiring it to be killed manually. Actual results: VM needs to be killed manually. Expected results: VM should be killed regardless. Additional info:
- The following are extracts from qemuDomainDestroyFlags() in libvirt's qemu_driver.c; a) RHEL6; if (flags & VIR_DOMAIN_DESTROY_GRACEFUL) { if (qemuProcessKill(driver, vm, 0) < 0) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", _("failed to kill qemu process with SIGTERM")); priv->beingDestroyed = false; goto cleanup; } b) RHEL7; if (flags & VIR_DOMAIN_DESTROY_GRACEFUL) { if (qemuProcessKill(vm, 0) < 0) { priv->beingDestroyed = false; goto cleanup; } I'm not sure why this changed, I have been able to find a bug in which this was done. - VDSM does the following in 'vm.py'; def _destroyVmGraceful(self): try: self._dom.destroyFlags(libvirt.VIR_DOMAIN_DESTROY_GRACEFUL) except libvirt.libvirtError as e: # after succesfull migraions if (self.lastStatus == vmstatus.DOWN and e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN): self.log.info("VM '%s' already down and destroyed", self.conf['vmId']) else: self.log.warning("Failed to destroy VM '%s' gracefully", self.conf['vmId'], exc_info=True) if e.get_error_code() == libvirt.VIR_ERR_OPERATION_FAILED: return self._destroyVmForceful() return errCode['destroyErr'] return {'status': doneCode} def _destroyVmForceful(self): try: self._dom.destroy() except libvirt.libvirtError: self.log.warning("Failed to destroy VM '%s'", self.conf['vmId'], exc_info=True) return errCode['destroyErr'] return {'status': doneCode} In RHEL6 it reacts to the VIR_ERR_OPERATION_FAILED returned by libvirt and calls _destroyVmForceful(). In RHEL7, in my testing, I see it return VIR_ERR_SYSTEM_ERROR instead, and so _destroyVmForceful() doesn't get called.
Gordon, thank you for the detailed and concise bug description
(In reply to Gordon Watson from comment #0) > Description of problem: > > When powering off a VM running on a RHEL7 host via the RHEV GUI, if SIGTERM > fails, then VDSM is no longer calling _destroyVmForceful(), as it does in > RHEL6. > > There are conditions where a VM cannot be killed by SIGTERM. VDSM attempts > to handle this by checking for a specific error code, > VIR_ERR_OPERATION_FAILED, returned by libvirt and then calling > _destroyVmForceful() in vm.py. > > In RHEL6, libvirt returns VIR_ERR_OPERATION_FAILED, but in RHEL7 it isn't, > and therefore VDSM never calls _destroyVmForceful(). > > > Version-Release number of selected component (if applicable): > > - RHEV 3.5.7 > - RHEL 7.2 host; > vdsm-4.16.32-1.el7 > libvirt-1.2.17-13.el7_2.3 > > > How reproducible: > > If a VM can be placed in a state where SIGTERM fails, then it should fail > every time. > > > Steps to Reproduce: > > 1. Force libvirt to not send a SIGTERM to a guest process ( I used 'gdb' and > changed the signal to something else, plus a few other things). > 2. In RHEL6, VDSM will report that SIGTERM failed and then it will call > _destroyVmForceful() and the guest process will get killed. > 3. In RHEL7, VDSM will report that SIGTERM failed and leave the guest > process as is, requiring it to be killed manually. > > Actual results: > > VM needs to be killed manually. > > > Expected results: > > VM should be killed regardless. > > > Additional info: Thanks for the very good bug report. Looking at the code, I can't recall a good reason for doing logic on return value. Vdsm should try harder to destroy a Vm, after all is what Engine requested, instead of catching up with libvirt return codes. Let's start getting rid of this logic in the destroy path.
Bug 795656 was opened since we used to kill VMs brutally, before even trying to notify the guest. I'd like not to repeat it, and I am afraid that VIR_ERR_SYSTEM_ERROR is a too general error code. If received from libvirt, I am not sure whether a graceful attempt has taken place. Jiri, can we make sure that there was an attempt to notify the guest of the imminent shutdown?
Yes, VIR_ERR_SYSTEM_ERROR is a pretty good indication that libvirt tried to kill the process. However, I'm still not sure what semantics you want to implement. The current semantics of _destroyVmGraceful doesn't seem to be very clear. IMHO it should either try to kill a vm gracefully and return an error if it fails, or it should always kill forcefully if graceful attempt fails. Having an API which tries to kill a vm gracefully and sometimes kills it forcefully if the graceful attempt failed looks pretty strange to me.
(In reply to Jiri Denemark from comment #7) > Yes, VIR_ERR_SYSTEM_ERROR is a pretty good indication that libvirt tried to > kill the process. Fine! I was looking again at this issue, and I come to think that trying again in presence of such error could perhaps only cure the symptoms, not the real disease. I mean: if we asked libvirt to kill a VM gently, and this failed for VIR_ERR_SYSTEM_ERROR, that most likely means that ultimately kill(2) failed; In turn, the most likely cause for such a basic syscall to fail is that the process was stuck in the kernel, perhapse in Z state. If so, would another destroy attempt be beneficial? What does a well-behaving libvirt client do in this case? Retrying harder (virDomainDestroy() - without the VIR_DOMAIN_DESTROY_GRACEFUL) is indeed all what we can do? Jiri, any advice for this case? > However, I'm still not sure what semantics you want to implement. The > current semantics of _destroyVmGraceful doesn't seem to be very clear. IMHO > it should either try to kill a vm gracefully and return an error if it > fails, or it should always kill forcefully if graceful attempt fails. Having > an API which tries to kill a vm gracefully and sometimes kills it forcefully > if the graceful attempt failed looks pretty strange to me. Indeed! This has be improved in related patches.
(In reply to Francesco Romani from comment #8) > I mean: if we asked libvirt to kill a VM gently, and this failed for > VIR_ERR_SYSTEM_ERROR, that most likely means that ultimately kill(2) failed; Well if kill failed and you try again without VIR_DOMAIN_DESTROY_GRACEFUL, it will fail again (unless the reason for the failure is gone). If the failure was caused by a timeout, chances are the next kill will succeed (or not if the process is stuck in D state, for example). > If so, would another destroy attempt be beneficial? What does a > well-behaving libvirt client do in this case? Retrying harder > (virDomainDestroy() - without the VIR_DOMAIN_DESTROY_GRACEFUL) is indeed all > what we can do? It depends on what such client wants to do. If it wants to kill the domain while still trying to be nice, it will probably do virDomainDestroy(VIR_DOMAIN_DESTROY_GRACEFUL) followed by virDomainDestroy(). If the client wants to avoid any possible data loss or corruptions, it should never call virDomainDestroy without GRACEFUL flag.
Thanks Jiri, will work in this direction (https://gerrit.ovirt.org/#/c/55224/)
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
at this moment, only patch https://gerrit.ovirt.org/#/c/55537/ is not merged. However, https://gerrit.ovirt.org/#/c/55537/ is not needed to fix this bug (but it is a related improvement), and it is scheduled for merge anyway. So it is fair to move this bug as MODIFIED.
Verified with: ENGINE: rhevm-4.0.0.2-0.1.el7ev.noarch HOST: vdsm-4.18.999-31.gitb0bc4c6.el7.centos.x86_64 Followed Gordon's steps on comment 3 using gdb to invoke '_destroyVmGraceful' failure with VIR_ERR_SYSTEM_ERROR. Verified that failure invoked the call to '_destroyVmForceful' and eventually destroyed the vm.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1671.html