Description of problem: When issuing multiple Nova deletes, some guests go into ShutOff state instead of being deleted. Version-Release number of selected component (if applicable): [root@pcloud16 ~]# libvirtd --version libvirtd (libvirt) 0.10.2 [root@pcloud16 ~]# nova --version 2.15.0 [root@pcloud16 ~]# rpm -qa | grep nova openstack-nova-compute-2013.2.2-2.el6ost.noarch python-novaclient-2.15.0-2.el6ost.noarch python-nova-2013.2.2-2.el6ost.noarch openstack-nova-common-2013.2.2-2.el6ost.noarch [root@pcloud16 ~]# How reproducible: Depends on the volume of deletes happening concurrently. Steps to Reproduce: 0. RHEL OSP env. 1x controller, 1x neutron server, 1x compute node 1. Launch multiple guests (I see this at 40 guests on a single compute node) 2. Connect via SSH to the guests (I am running netperf between two guests) 3. Delete All guests Actual results: Guest gets put into ShutOff state. libvirtd logs: 30610: warning : qemuProcessKill:4174 : Timed out waiting after SIGTERM to process 26111, sending SIGKILL 2014-03-05 07:31:42.466+0000: 30610: warning : qemuProcessKill:4206 : Timed out waiting after SIGKILL to process 26111 2014-03-05 07:31:42.466+0000: 30610: error : qemuDomainDestroyFlags:2098 : operation failed: failed to kill qemu process with SIGTERM Nova log: 2855 ERROR nova.virt.libvirt.driver [req-0e54e859-0a4e-42e9-a75b-4850efe4e6d1 313a479fb18348f980cc6870e5051104 fabc2431f2e2412b96459c918f749f54] [instance: cc4f67e0-fc8e-495b-a507-649699d46ed7] Error from libvirt during destroy. Code=9 Error=operation failed: failed to kill qemu process with SIGTERM Expected results: Nova to delete the guest or keep trying to remove the guest. Additional info:
This doesn't look like a bug in qemu-kvm. It rather looks like the host is so busy processes don't die early enough even after SIGKILL. I guess deleting a guest in nova means the guest is destroyed and undefining it. It seems to me that nova calls virDomainDestroyFlags() and since it fails (due to the timeout), it doesn't do the rest. And once the qemu-kvm process actually dies, nova reports the guest as ShutOff. A possible solution could be to move the rest of the deletion process to a VIR_DOMAIN_EVENT_STOPPED handler, which would make sure the domain is deleted once it stops. Another option would be to introduce new flag for virDomainDestroyFlags to make it wait until the process dies instead of giving up after 15 seconds.
Dan, can you take a look at this and see if nova's libvirt driver should be changed per Jiri's comment #4?
Seems Nova could just retry the destroy operation, and/or undefine the guest regardless.
By email we discussed whether or not this was still an issue with a newer version of Nova. I don't think I heard back on that. If it is, I think this is something that should just be tracked upstream. Please report a bug on launchpad if you're able to reproduce. Thanks!
Eduard, we haven't received any further information from you about the customer's issue in response to Dan's comment. As a result I am closing this bug.
A bug is opened in OpenStack community --> https://bugs.launchpad.net/nova/+bug/1387950